Skip to main content

Full text of "Econometrics"

See other formats


S. Valavanis 

Published on demand by 

University Microfilms Limited, High Wycomb, England 
A Xerox Company, Ann Arbor, Michigan, U.S.A. 














Digitized by the Internet Archive 
in 2013 

This is an authorized facsimile of the original 
book, and was produced in 197^ by microfilm- 
xerography by Xerox University Microfilms, 
Ann Arbor, Michigan, U.S.A. 


An Introduction to Maximum Likelihood Methods 

Assistant Professor of Economics 
Harvard University, 1956 to 1958 



Assistant Professor of Economics 
Harvard University 

D , u) ' 

y I* 7 


New York Toronto London 





Advisory Committee: Edward H. Chamberlain, Gottfried 
Haberler, Alvin H. Hansen, Edward S. Mason, and John H. 
Williams. All of Harvard University, 

Burns • Social Security and Public Policy 

Duesenberry • Business Cycles and Economic Growth 

Hansen • The American Economy 

Hansen • A Guide to Keynes 

Hansen • Monetary Theory and Fiscal Policy 

Harris • International and Interregional Economics 

Henderson and Quandt • Microeconomic Theory 

Hoover • The Location of Economic Activity 

Kindleberger • Economic Development 

Lerner • Economics of Employment 

Valavanis • Econometrics 


ECONOMETRICS. Copyright © 1059 by the McGraw-Hill Book Company, 
Inc. Printed in the United States of America. All rights reserved. This 
book, or parts thereof, may not be reproduced in any form without permis- 
sion of the publishers. Library of Congress Catalog Card Number 58-14363 


Editor's introduction 

For years many teachers of economics and other professional 
economists have felt the need of a series of books on economic subjects 
which is not filled by the usual textbook, nor by the highly technical 

This present series, published under the general title of Economics 
Handbook Series, was planned with these needs in mind. Designed 
first of all for students, the volumes are useful in the ever-growing field 
of adult education and also are of interest to the informed general 

The volumes present a distillate of accepted theory and practice, 
without the detailed approach of the technical treatise. Each volume 
is a unit, standing on its own. 

The authors are scholars, each writing on an economic subject on 

C which he is an authority. In this series the author's first task was not 

, 2 to make important contributions to knowledge— although many of 

^ them do — but so to present his subject matter that his work as a 

„ v «~J scholar will carry its maximum influence outside as well as inside the 

cv classroom. The time has come to redress the balance between the 

energies spent on the creation of new ideas and on their dissemination. 

Economic ideas are unproductive if they do not spread beyond the 

world of scholars. Popularizers without technical competence, 

v unqualified textbook writers, and sometimes even charlatans control 

i too laige a part of the market for economic ideas. 

In the classroom the Economics Handbook Series will seive, it is 
hoped, as brief surveys in one-semester courses, as supplementary 
reading in introductory courses, and in other courses in which the 
subject is related. 

Seymour E. Harris 
> J , v 



Editor's preface 

The editor welcomes Stefan Valavanis' study of econometrics into 
the Economics Handbook Series as a unique contribution to eco- 
nometrics and to the teaching of the subject. 

Anyone who reads this book will understand the tragedy of the 
death of Stefan Valavanis. He was brilliant, imaginative, and a first- 
class scholar and teacher, and his death is a great loss to the world of 

Professor Valavanis had virtually completed his book just befora 
his departure for Europe in the summer of 1958. But, as is alwaya 
true of a manuscript left with the publisher, though it was essentially 
complete much remained to be done. My colleague, Professor Alfred H, 
Conrad, volunteered to finish the job. Unselfishly he put the final 
touches on the book, went over the manuscript, checked the math- 
ematics, assumed the responsibility for seeing it through the press, 
and helped in many other ways. Without his help, the problem of 
publication would have been a serious one. The publisher and editor 
are indeed grateful. 

This book is an introduction to econometrics, that is, to the tech- 
niques by which economic theories are brought into contact with the 
facts. While not in any sense a "cookbook," its orientation is 
constantly toward the strategy of economic research. Within the 
field of econometrics, the book is primarily addressed to the problems 
of estimation rather than to the testing of hypotheses. It is concerned 
with estimating, from the insufficient information available, the values 

viii editor's preface 

or magnitudes of the variables and relationships suggested by economic 
analysis. The maximum likelihood and limited information tech- 
niques are developed from fundamental assumptions and criteria and 
demonstrated by example; their costs in accuracy and computation 
are weighed. There are short but careful treatments of identification, 
instrumental variables, factor analysis, and hypothesis testing. The 
book proceeds much more by statements of problems and examples than 
by the development of mathematical proofs. 

The main feature of this book is its pedagogical strength. While 
rigor is not sacrificed and no mathematical or statistical rabbits are 
pulled out of the author's hat, the statistical tools are always presented 
in terms of the fundamental limitations and criteria of the real world. 
Almost every concept is introduced by an example set in this world of 
real problems and difficulties. Mathematical concepts and notational 
distinctions are most often handled in clearly set off "digressions." 
The fundamental notions of probability and matrix algebra are 
reviewed, but in general it is assumed that the student has 
already been introduced to determinants and matrices and the 
elementary properties and processes of differentiation. (No more 
knowledge of mathematics is required than for any of the other 
comparable texts, and, thanks to the pedagogical skills of the author, 
probably considerably less.) Frequent emphasis is placed upon com- 
putation design and requirements. 

Valavanis' book is brilliantly organized for classroom presentation, 
most of the statistical and mathematical assumptions and concepts 
being treated verbally and by example before they appear in any 
mathematical formulation. In addition to the examples used in 
presentation, there are exercises in almost every chapter. 

Seymour E. Harris 


This work is neither a complete nor a systematic treatment of 
econometrics. It definitely is not empirical. It has one unifying idea: 
to reduce to common-sense terms the mathematical statistics on which 
the theory of econometrics rests. 

If anything in econometrics (or in any other field) makes sensa, one 
ought to be able to put it into words. The result may not ba so com- 
pact as a close-knit mathematical exposition, but it can be, In its 
own way, just as elegant and clear. 

Putting symbols and jargon into words understandable to a wider 
audience is not the only thing I want to do. I think that watering 
down a highly refined or a very deep mathematical argument is a 
useful activity. For instance, if the essence of a problem can be 
captured by two variables, why tackle n? Or why worry about 
mathematical continuity, existence, and singularity in a discussion of 
economic matters, unless these intriguing properties have interesting 
economic counterparts? We would be misspending effort if all the 
reader wants is an intelligent layman's idea of what is going on in the 
field of econometrics. For the sake of the punctilious, I shall give 
warning every time my heuristic "proof" is not watertight or when- 
ever I slur over an unessential mathematical maze. 

Much of econometric literature suffers from overfancy notation. If 
I judge rightly, many people quake at the sight of every new issue of 
Econometrica. I hope to show them the intuitive good sense that hides 
behind the mathematical squiggles. 



Besides restoring the self-assurance of the ordinary intelligent 
reader and helping him discriminate between really important develop- 
ments in econometric method and mere mathematical quibbles, I have 
tried to be useful to the teachers of econometrics and peripheral 
subjects by supplying them with material in "pedagogic" form. And 
lastly, I should like to amuse and surprise the serious or expert eco- 
nometrician, the connoisseur, by serving him familiar hash in nice new 
palatable ways but without loss of nutrient substance. 

The gaps in this work are intentional. One cannot, from this book 
alone, learn econometrics from the ground up; one must pick up 
elementary statistical notions, algebra, a little calculus — even some 
econometrics — elsewhere. 

For the beginner in econometrics, an approximately correct sequence 
would be the books of Beach (1957), Tinbergen (1951), Klein (1953), 
and Hood (1953); with Tintner (1942) as a source of examples or 
museum for the numerous varieties of quantitative techniques in 
existence. Tinbergen emphasizes economic policy; Klein, the busi- 
ness cycle and macro-economics; Tintner, the testing of hypotheses 
and the analysis of time series. All three use interesting empirical 
examples. For elementary mathematics the first part of Beach 
(perhaps also Klein, appendix on matrices) is enough. Reference to 
all these easily available and digestible texts is meant to avoid my 
repeating what has been said by others. 

From time to time, however, I make certain "digressions"; these 
are held in from the margins. These digressions have to do mostly 
with mathematical and statistical subjects that in my opinion are 
either inaccessible or not well explained elsewhere. 

Stefan Valavanis 


Harvard University, for grants from the Joseph H. Clark Bequest and 
from the Ford Foundation's Small Research Fund in the Department 
of Economics. 

Professor and Mrs. William Jaff6 

Frofessor Arthur S. Goldberger 

Professor Richard E. Quandt 

Professor Frederick F. Stephan 


Editor's introduction V 

Editor's preface vii 

Preface IZ 

Digressions xv 

Frequent references ivii 

Chapter 1. The fundamental proposition of econometrics . , . 1 

1.1. What econometrics is about . 1 

1.2. Mathematical tools 2 

1.3. Outline of procedure and main discoveries in the next hundred pages 3 

1.4. All-importance of statistical assumptions 4 

1.5. Rationalization of the error term 5 

1.6. The fundamental proposition 6 

1.7. Population and sample * . 7 

1.8. Parameters and estimates 8 

1.9. Assumptions about the error term 9 

1. u is a random real variable 9 

2. ut, for every t, has zero expected value 10 

3. The variance of u t is constant over time 13 

4. The error term is normally distributed 14 

5. The random terms of different time periods are independent . . 16 

6. The error is not correlated with any predetermined variable . . 16 

1.10. Mathematical restatement of the Six Simplifying Assumptions . . 17 

1.11. Interpretation of additivity 18 

1.12. Recapitulation 18 

Further readings 21 



Chapter 2. Estimating criteria and the method of least squares . 23 

2.1. Outline of the chapter 23 

2.2. Probability and likelihood 24 

2.3. The concept of likelihood function 28 

2.4. The form of the likelihood function 29 

2.5. Justification of the least squares technique 31 

2.G. Generalized least squares 33 

2.7. The meaning of unbiasedness 35 

2.8. Variance of the estimate 39 

2.9. Estimates of the variance of the estimate 41 

2.10. Estimates ad nam earn 43 

2.11. The meaning of consistency 43 

2.12. The merits of unbiasedness and consistency 45 

2.13. Other estimating criteria 47 

2.14. Least squares and the criteria 47 

2.15. Treatment of heteroskedasticity 48 

Further readings 50 

Chapter 3. Bias in models of decay 52 

3.1. Introduction and summary 52 

3.2. Violation of Simplifying Assumption 6 53 

3.3. Conjugate samples 54 

3.4. Source of bias 57 

3.5. Extent of the bias 58 

3.6. The nature of initial conditions 58 

3.7. Unbiased estimation CO 

Further readings, . C2 

Chapter 4. Pitfalls of simultaneous interdependence 63 

4.1. Simultaneous interdependence 63 

4.2. Exogenous variables 64 

4.3. Haavelmo's proposition. 64 

4.4. Simultaneous estimation 67 

4.5. Generalization of the results 70 

4.6. Bias in the secular consumption function 71 

Further readings 71 

Chapter 5. Many-equation linear models 73 

5.1. Outline of the chapter 73 

5.2. Effort-saving notation 74 

5.3. The Six Simplifying Assumptions generalized 77 

5.4. Stochastic independence 78 

5.5. Interdependence of the estimates 82 

5.6. Recursive models 83 

Further readings 84 

Chapter 6. Identification 85 

6.1. Introduction 85 

6.2. Completeness and nonsingularity 86 


6.3. The reduced form 87 

6.4. Over- and underdeterminacy $$ 

6.5. Bogus structural equations. 89 

6.6. Three definitions of exact identification 90 

6.7. A priori constraints on the parameters 91 

6.8. Constraints on parameter estimates . 94 

6.9. Constraints on the stochastic assumptions 95 

6.10. Identifiable parameters in an underidentificd equation 96 

6.11. Source of ambiguity in overidentified models 98 

6.12. Identification and the parameter space 100 

6.13. Over- and underidentification contrasted . 102 

6.14. Confluence 103 

Further readings 106 

Chapter 7. Instrumental variables 107 

7.1. Terminology and results 107 

7.2. The rationale of estimating parametric relationships 108 

7.3. A single instrumental variable 110 

7.4. Connection with the reduced form 113 

7.5. Properties of the instrumental variable technique in the simplest case 114 

7.6. Extensions 115 

7.7. How to select instrumental variables 116 

Chapter 8. Limited information 118 

8.1. Introduction 118 

8.2. The chain of causation 119 

8.3. The rationale of limited information . . . 121 

8.4. Formulas for limited information 123 

8.5. Connection with the instrumental variable method. . . . . . 125 

8.6. Connection with indirect least squares 125 

Further readings 125 

Chapter 9. The family of simultaneous estimating techniques . . 126 

9.1. Introduction . 126 

9.2. Theil's method of dual reduced forms 126 

9.3. Treatment of models that are not exactly identified 128 

9.4. The "natural state" of an econometric model 130 

9.5. What are good forecat-ts? 132 

Further readings. 134 

Chapter 10. Searching for hypotheses cud testing them .... 136 

10.1. Introduction 136 

10.2. Discontinuous hypotheses 137 

10.3. The null hypothesis 138 

10.4. Examples of rival hypotheses <8 

10.5. Linear confluence i42 

10.6. Partial correlation 144 

10.7. Standardized variables 145 

10.8. Bunch map analysis 146 


10.9. Testing for linearity 150 

10.10. Linear versus ratio models 152 

10.11. Split sectors versus sector variable 153 

10.12. How hypotheses are chosen 154 

Further readings 155 

Chapter 11. Unspecified factors 15G 

11.1. Reasons for unspecified factor analysis 150 

11.2. A single unspecified variable 157 

11.3. Several unspecified variables 159 

11.4. Linear orthogonal factor analysis 1G0 

11.5. Testing orthogonality 163 

11.6. Factor analysis and variance analysis 164 

Further readings 165 

Chapter 12. Time series , 166 

12.1. Introduction , 166 

12.2. The time interval 167 

12.3. Treatment of serial correlation 168 

12.4. Linear systems „ 171 

12.5. Fluctuations in trendless time series 172 

12.6. Correlograms and kindred charts 174 

12.7. Seasonal variation 176 

12.8. Removing the trend. 178 

12.9. How not to analyze time series 181 

12.10. Several variables and time series 182 

12.11. Time aeries generated by structural models 183 

12.12. The over-all autorcgrcssion of the economy 185 

12.13. Leading indicators . 186 

12.14. The diffusion index . 188 

12.15. Abuse of long-term series 190 

12.16. Abuse of coverage 191 

12.17. Disagreements between cross-section and time series estimates . . 192 
Further readings 195 

Appendix A. Layout of computations 197 

The rules in detail 198 

Matrix inversion . 201 

B. Stepwise least squares 203 

C. Subsample variances as estimators 207 

D. Proof of least squares bias in models of decay .... 209 

E. Completeness and stochastic independence 213 

F. The asterisk notation. 214 

Index 217 


On the distinction between probability and probability density . . . 9 

On the distinction between "population" and "universe" .... 11 

On infinite variances 14 

On the univariate normal distribution 14 

On the differences among moment, expectation, and covarianc©. . . 13 

On the multivariate normal distribution 26 

On computational arrangement 32 

On matrices of moments and their determinants 34 

On notation 44 

On arbitrary weights 50 

On directional least squares 69 

On Jacobians . 79 

On the etymology of the term "multicollinearity" 105 

On correlation and kindred concepts . 141 

On moving averages and sums 188 


Frequent references 

Names in small capital letters refer to the following works: 

Allen R. G. D. Allen, Mathematical Economics. New York: St. Mirtiis's 

Press, Inc., 1956. xvi, 768 pp., illus. 
Beach Earl F. Beach, Economic Models: An Exposition. N§w fork: 

John Wiley & Sons, Inc., 1957. xi, 227 pp., illus. 
Hood William C. Hood and Tjalling C. Koopmans (eds.), Studies in 

Econometric Method, Cowles Commission Monograph 14. New York: 

John Wiley & Sons, Inc., 1953. xix, 323 pp., illus. 
Kendall Maurice G. Kendall, The Advanced Theory of Statistics , voll, I, II. 

London: Charles Griffin & Co., Ltd., 1943; 5th ed„ 1952. Vol. I, m 4$7 

pp., illus.; vol. II, vii, 521 pp. illus. 
Klein Lawrence R. Klein, A Textbook of Econometrics. Evanston: Row* 

Peterson & Company, 1953. ix, 355 pp., illus. 
Koopmans Tjalling C. Koopmans (ed.), Statistical Inference in Dynawie 
>w Economic Models, Cowles Commission Monograph 10. New Yofkj 

John Wiley & Sons, Inc>, 1950. xiv, 438 pp., illus. 
Tinbergen Jan Tinbergen, Econometrics, translated from the Dutch by 

H. Rijken van Olst. New York: McGraw-Hill Book Company, Inc?., 

Blakiston Division, 1951. xii, 258 pp., illus. 
Tintner Gerhard Tintner, Econometrics. New York: John Wiley ft 

Sons, Inc., 1952. xiii, 370 pp. 



The fundamental proposition 
of econometrics 

1.1. What econometrics is about 

An econometrician's job is to express economic theories in mathe- 
matical terms in order to verify them by statistical methods, and to 
measure the impact of one economic variable on another so as to be 
able to predict future events or advise what economic policy should bs 
followed when such and such a result is desired. 

This definition describes the major divisions of econometrics, namely, 
specification, estimation, verification, and prediction. 

Specification has to do with expressing an economic theory in mathe- 
matical terms. This activity is also called model building. A model 
is a set of mathematical relations (usually equations) expressing an 
economic theory. Successful model building requires an artist's touch, 
a sense of what to leave out if the set is to be kept manageable, elegmit, 
and useful with the raw materials (collected data) that are available. 
This book deals only incidentally with the "specification" aspect of 

The problem of estimation is to use shrewdly our all too scanty data, 



so as to fill the formal equations that make up the model with numerical 
values that are good and trustworthy. Suppose we have the following 
simple theory to quantify: Consumption today (C) depends on yester- 
day's income (Z) in such a way that equal increments of income, no 
matter what income level you start from, always bring equal increments 
in consumption. Letting a stand for consumption at zero income and 
7 for the_ marginal^ pjopensjty_tQ„iiQnsume, this theory can be expressed,, 

C t = a + yZ t (1-1) 

The problem of estimation (and the main concern of this book) is to 
discover how to use whatever experience we have about consumption 
C and income Z in order to make a shrewd guess about how large a 
and 7 mi^.ht really be. The problem of estimation is to guess correctly 
a and 7, t he 'parameters (or inherent characteristics) of the consumption 


J&rinLe&timaU&n is making the best possible single guess about a and 
about 7. Interval estimation is guessing how far ojir^uessjo^^may be 
from the true ^J^nd our guess of 7 from the true_x* 

It is not enough, of course, to be able to make correct point and 
interval estimates. We want to make them as cheaply as possible. 
Wg.HSJl^^^^^^Lfej^cfe.Wt. programming of computations , checks of 
accuracy, and shozLSUlS- Though this aspect of estimation will not 
occupy us very much, I shall give some computational advice from 
time to time. 

Verification sets up criteria of success and uses these criteria to 
accept or reject the economic theory we are testing with the model and 
our v data. It is a tricky subject deeply rooted in the mathematical 
theory of statistics. 

Prediction involves rearranging the model into convenient shape, so 
that we can feed it information about new developments in exogenous 
and lagged variables and grind out answers about the impact of these 
variables on the endogenous variables. 

1.2. Mathematical tools 

In explaining how to fashion good estimates ft»r the parameters of an 
econometric model I shall often step into th$ mathematical statis- 


tician's toolroom to bring out one gadget or another required by the 
next step of our procedure. These digressions are clearly marked m 
they can be skipped by those acquainted with the tool in question. 

The mathematical tools used again and again are elementary: 
analytic geometry, which makes equations and graphs interchangeably; 
probability, a concept enabling us to make precise statements ..about 
uncertain events; the derivative (or the operation of differentiating), 
which is a help in making a "best" guess among all possible guessej; 
moments, which are a sophisticated way of averaging various magni- 
tudes; and matrices, which are nothing but many-dimensional ordinary 
numbers — indeed, statements that are true of ordinary numbers 
seldom fail for matrices — for instance, you can add, subtract, multiply, 
and divide matrices analogously to numbers and, in general, handle 
them as if they were ordinary numbers though perhaps more fragile; a 
vector is a special kind of matrix. 

1.3. Outline of procedure and main discoveries in the 
next hundred pages 

I. We shall deal first with models consisting of a single equation. 
We shall find that even in this simple case there are important 

A. It is not always possible to estimate the parameters of even a 
single-equation model, for two sorts of reason: 

1. f e may lack enough data. This is called the 'problem of 
degrees of freedom. 

2. Though the data are plentiful, they may not be rich or varied 
enough. This is the problem of ' multicollinegjtifa. 

B. Our second important finding will be that "pedestrian" 
methods of estimation, for example the least squares fit, are 
apt to be treacherous. They either give us erroneous impres- 
sions about the true values of the parameters or waste the data. 

II. Turning then to models containing two or more equations, our 

main findings will be the following: 

A. It is sometimes impossible to determine the value of each 
parameter in each equation, but this time not merely for lack 
of data or their monotony, but rather because the equations 
look too much like one another to be disentangled. Econ- 


ometricians call this undesirable property lack of identifiability. 

B. Nonpedestrian, statistically sophisticated methods become very 
complex and costly to compute when the model increases from 
a single equation even to two. 

C. Happily, however, by sacrificing some of the rigor of these ideal 
" equestrian " methods in special, shrewd ways, we can cut 
the burden of computation by a factor of 5 or 10 and still get 
pretty good results. Such techniques are called limited 
i^ormation^Jzchmj&uej, because they deliberately disregard 
refinements that should ideally be taken into account. Most 
theoretical econometricians work in this field, because the need 
is very great to know not only how to boil down complexity 
with clever tricks but also precisely how much each trick costs 
us in accuracy. 

1*4. All-importance of statistical assumptions 

The key word in estimation is the word stgdwslic. ItSwQPJiQSJteJis^ 
exact or systematic. 

Stochastic comes from the Greek stokhos (a target, or bull's-eye). 
Tjie outcome of throwing darts is a stochastic process, that is to say, 
fraught with occasional misses. In economics, indeed in all empirical 
disciplines, we do not expect our predictions to hit the bull's-eye 
100 per cent of the time. 

Econometrics begins by saying a great deal more about this matter 
of missing the mark. Where ordinary economic theory merely recog- 
nizes that we miss the mark now and then, econometrics makes 
statistical assumptions. These are precise statements about the par- 
ticular way the darts hit the target's rim or hit the wall. Everything — 
estimation, prediction, and verification — depends vitally on the content 
of the statistical assumptions. Econometric models emphasize this 
TacTby using a special variable jw called the error term. The error 
term varies from instance to instance, just as one dart falls above, 
another below, one to the left, another to the right, of the target. A 
subscript t serves to indicate the various values of the error term. To 
make model (1-1) stochastic, we write 

C t = a + yZ t + u t (1-2) 


Before going on to rationalize the presence of the error term u m 
equation (1-2), two things must be explained. First, u could have been 
included as a multiplicative factor or as an exponential rather than as 

an additive term. Second^ its subscript t need not express time. It 

can refer just as well to various countries or income classes. 

To facilitate the exposition, I shall henceforth treat u t as an additive 
term and take / to represent time. Exceptions will be clearly labeled. 
The common-sense interpretation of additivity is deferred to Sec. 1.11. 

1.5. Rationalization of the error term 

There are four types of reasons why an econometric model should be 
stochastic and not exact: incomplete theory, imperfect specification^ 
aggregation of data, and errors of measurement. Not all of them 
apply to every model. 

1. Incomplete theory 

A theory is necessarily incomplete, an abstraction that cannot 
explain everything. For instance Jaimoimple theory of consumption*, 

a. We have left out possible variables, like wealth and liquid assets^ 
that also affect consumption. 

b. We have left out equations. The economy is much more complex 
than a single equation, no matter how many explanatory variables this 
single equation may contain; there may be other links between con- 
sumption and income besides the consumption function. 1 

c. Human behavior is "ultimately" random. 

2. Imperfect specification 

We have linearized a possibly nonlinear relationship. 

3. Aggregation of data 

We have aggregated over dissimilar individuals. Even if each of 

them possessed his own a and y and if his consumption reacted in 

exact (nonstochastic) fashion to his past income, total consumption 

would not be likely to react exactly in response to a given total income, 

because its distribution may change. Another way of putting this is: 

1 How many independent links there may be and how we are to find them is 
itself a problem in statistical inference, and is treated briefly in Chap. 10. 


Variables expressing individual peculiarities are missing (cf. la). Or 
this way: Equations that describe income distribution are missing 
(cf. 16). 

4. Errors of measurement 

Even if behavior were exact, survey methods are not, and our 
statistical series for consumption and income contain some errors of 
measurement. Throughout this book we pretend that all variables are 
measured without error. 

1.6. The fundamental proposition 

All we get out of an econometric model is already implied in: 

1. Its specification; that is to say, consumption C depends on yester- 
day's income Z as in the equation C% = a + yZ t + u t and in no other 
way 1 

2. Our assumptions concerning u, that is to say, the particular way 
we suppose the relationship between C and Z to be inexact 2 

3. Our sampling procedure, namely, the way we arrange to get data 

4. Our sample, i.e., the particular data that happen to turn up after 
we decide how to look for them 

5. Our estimating criterion, i.e., what properties we desire our 
estimates of a and 7 to have, short of the unattainable: absolute 

Over items 1, 2, 3, and 5 we have absolute control; for we are free to 
change our theory of consumption, our set of assumptions concerning 
the error term, our data-collecting techniques, and our estimating 
criterion. We have no control over item 4; for what data actually 
turn up is a matter of luck. 

According to this fundamental proposition, what estimates we get 

for the parameters (a and 7) depends, among other things, on the 

stochastic assumptions, i.e., what we choose to suppose about the 

behavior of the error term u. Every set of assumptions about 

the error term prescribes a certain way of guessing at the true value of 

the parameters. And conversely, every guess about the parameters is 

Implicitly a set of stochastic assumptions. 

! This is also called "structural specification." 
5 This is also called "stochastic specification." 


The relationship between stochastic assumptions and parameter 
estimates is not always a one-to-one relationship. A given set of 
stochastic assumptions is compatible with several sets of different 
parameter estimates, and conversely. In practice, we don't have to 
worry about these possibilities, because we shall be making assump- 
tions about u that lead to unique guesses about a and 7, or, at the 
very worst, to a few different guesses. Also, in practice, since we 
usually are interested in a and 7, not in verifying our assumptions 
about u, it does not matter that many different u assumptions are 
compatible with a single set of parameter estimates. 

1,7. Population and sample 

The whole of statistics rests on the distinction between population 
and observation or sample. People were receiving income and con- 
suming it long before econometrics and statistics were dreamed of. 
There is, so to speak, an underlying population of C's and Z's, which 
we can enumerate, hypothetically, as follows: 

C\, C2, . 

. . , Cp, . 

. ,c P 


C p forp - 1, 2, . . . ,P 

Zi, Z2, . 

• > Zpi ' 

. . ,Z P 


Z p for p - 1, 2, . . . , P 

Of these C's and Z'b we may have observed all or some. Those that 
we have observed take on a different index s instead of p, to emphasize 
that they form a subset of the population. All we observe, then, 
is C, and Z„ where the s assumes some (perhaps all) of the values that 
p runs, but no more. Index s can start running anywhere, say, at 
p = 5, assume the values 6 and 7, skip 8, 25, and 92, and stop anywhere 
short of the value P or at P itself. In all cases that I shall discuss, 
the sample covers consecutive time periods, which are renumbered, for 
convenience, in such a way that the beginning of time coincides with 
the beginning of the sample, not of the population. Whether the 
sample is consecutive or not sometimes does and sometimes does not 
affect the estimation of a and 7. 

Note that we mean by the term sample a given collection of observe* 
tions, like S — (C$>,CioAi; Z 9) Zio,Zu), not an isolated observation. 
S is a sample of three observations, the following ones: {C^Z^) } (Cio,^ie), 
and (Cn,Zii). Samples made up of a single observation can exist, of 
course, but we seldom work with so little observation. 


1.8. Parameters and estimates 

Another crucial distinction is between a parameter and its estimate. 
If the theory is correct, there are, hiding somewhere, the true a and 7. 
These we never observe. What we do do is guess at them, basing 
ourselves on such evidence and common sense as we may have. The 
guesses, or estimates, always wear a badge to distinguish them from 
the parameters themselves. 


We shall use three kinds of badge for a parameter estimate: a roof- 
shaped hat, as in 4, i, to mark maximum likelihood estimates; a bird, 
the same symbol upside down, as in a, 7, for naive least squares esti- 
mates; and the wiggle, as in a, 7, for other kinds of estimates or for 
estimates whose kind we do not wish to specify. These types of 
estimate are defined in Chap. 2. 

The distinction between error and residual is analogous to the 
distinction between parameter and estimate. The error u t is never 
observed, although we may speculate about its behavior. It always 
goes with the real a and 7, as in (1-2), whereas the residual, which is an 
estimate of the error and whose symbol always wears a. distinctive 
badge, can be calculated, provided we have settled on a particular 
guess (<2,f ) or (a, 7) for the parameters. The value of the error does 
not depend on our guessing; it is just there, in the population and, 
therefor©, in the sample. The residual, however, depends on the 
particular guess. To emphasize this fact we put the same badge on 
the residual as on the corresponding parameter estimate. We write, 
for example, C t = & + ^Z t + Ut or C t = a -J- yZ t + U t . 

Now we can state precisely what the problem of estimation is, as 
follows. We assume a theory, for example, the theory of consumption 
C$ =- a 4- 7^1 + u t ; we assume that Ut behaves in some particular 
way (to which the next section is devoted) ; we get a set of observations 
on and Z (the sample). Then we manipulate the sample data to 
give us estimates a and 7 that satisfy our estimating criterion (dis- 
cussed in Chap. 2). Then we compute if we wish the residuals U t as 
Climates of the errors u t . 


1.9. Assumptions about the error term 

Besides additivity (Sec. 1.4) we shall now make and interpret m% 
assumptions about the error term. Of these, the first is indispensable, 
exactly as it stands. The other five could be different in content or in 
number. Note carefully that these are statements about the u's not 

the #'s. 

Assumption 1. u is a random real variable. 

If the model is stochastic, either its systematic variables (eonsump* 
tion or income) are measured with errors, or the consumption function 
itself is subject to random disturbances, or both. Since we have ruled 
out (Sec. 1.5) errors of measurement, the relationship itself km to \m 

" Random" is one of those words whose meaning everybody knows 
but few can define. Unpardonably, few standard texts glv© its 
definition. A variable is random if it takes on a number of different 
values, each with a certain probability. Its different values can be 
infinite in number and can range all over the field, provided th@r<§ are 
at least two of them. For instance, a variable w that is equal to 
— y% twenty-five per cent of the time, to 3 + \/2 forty per cent of the 
time, and to +35.3 thirty-five per cent of the time is a random variable. 
We may or may not know what values it takes on or their probabili- 
ties. Its probability distribution may or may not be expressible 
analytically. (See Sec. 2.2.) 

Digression on the distinction between probability 
and probability density 

A random variable w can be discrete^ like the number of remain- 
ing teeth in members of a group of people, or continuoxm, Ukf their 
weight. If w takes on a finite number of values, their probabili- 
ties can be quite simply plotted on and read off a dot di®%?&m or 
point graph (Fig. la). 

With a continuous variable we can usually speak only of the 
probability that its value should lie between (or at) such and 
such limits. In this case, we plot a probability-density graph 



(Fig. 16). The height of such a graph at a point is the probability 
density } and the relative area under the graph between any two 
points of the w axis is the probability. 

Assumption 2. u t , for every t, has zero expected value. 

Naively interpreted, this proposition says that the "average" value 
of Wi is zero, that of u 2 is also zero, and so forth. Or, to put it dif- 
ferently, it says that a prediction like C\ = a + yZi is "on the 
average" correct, that the same is true of C2 = a + 7^2, and so forth. 

1 - 

2 4 




1 1*1 

1 ! \ 


j. _ 1 1 1 1 1 

I 3+V2" 

35.3 w 

5 w 



Fsg. 1. A random variable, a. Variable w is discrete* The illustration is a dot 
diagram, or point graph, b. Variable w is continuous. The illustration is a 
probability-density graph. 

The trouble begins when you begin to wonder what "on the average" 
can possibly mean if you stick to a single instance, like time period 1 
(or time period 2). For every event happens in a particular way and 
not otherwise. Suppose, for instance, that in the year 1776 (t — 1776) 
consumption fell short of its theoretical level a + yZme by 2.24 
million dollars, that is, that Ume = —2.24. Obviously, then, the 
average value of Wme is exactly —2.24. What could we possibly wish 
to convey by the statement that Wine (and every other u t ) has zero 
expected value? 

One should never identify the concept of expected value with the 
concept of the arithmetic mean. Arithmetic mean denotes the sum of 
a set of N numbers divided by N and is an algebraic concept. Expected 


value is a statistical or combinatorial concept. You have to imagine an 
urn containing a (finite or infinite) number of balls, each with a number 
written on it. Consider now all the possible ways one could draw one 
ball from such an urn. The arithmetic mean of the numbers that 
would turn up if we exhausted all possible ways of drawing one ball is 
the expected value. 

The random term of an econometric model is assumed to come from 
an Urn of Nature which, at every moment of time, contains bails with 
numbers that add up to zero. 

The common-sense interpretation of Nature's Urn is as follows: 
Though in 1776 actual consumption in fact fell short of the theoretical 
by 2.24 and no other amount, the many causes that interacted to 
produce Wme = —2.24 could have interacted (in 1776) in various other 
ways. This, theoretically, they were free to do since they were 
random causes. Now, try to think of all conceivable combinations of 
these causes — or if you prefer, think of very many 1776 worlds, 
identical in all respects except in the combinations of random causes 
that generated the random term. Let us have as many such worlds 
as there are theoretical combinations of the causes behind the random 
term. In some worlds the causes act overwhelmingly to make con- 
sumption lower than the nonstochastic level a -f- T^me J in other 
worlds the causes act so as to make it greater than a + 7^n?eJ and in a 
few worlds the causes cancel out, so that Cm* = « + T^me exactly. 
Now consider the random terms of all possible worlds, and (says the 
assumption) they will average out to zero. 

This interpretation is a conceptual model we can never hope to 
prove or disprove. Its chief merit is that it reduces chance and 
statistics to the (relatively) easy language and theorems of com- 
binatorial algebra. Some people take it seriously; others (myself 
included) use it for lack of anything better. 

Digression on the distinction between "population" 

Whether or not we take Nature's Urn seriously, we will be well 
advised to acknowledge that we are dealing with three levels of 
discourse, not just the two that I called population and sample. 
The third and deeper level is called the universe. It contains all 


events as they have happened and as they might have happened if 
everything else had remained the same but the random shocks. 

Level I Sample : things that both happened and^were observed. 
It is drawn from 
Level II Population : things that happened but were not neces- 
sarily observed. It is drawn from 
Level III Universe: all things that could have happened. (In 
the nature of things only a few did.) 


We shall henceforth use four types of running subscript: 

8 - 1, 2, . . 

• ,8 

for the sample 

p = 1, 2, . . 

■ ,P 

for the population 


■ ,1 

for the universe 

t m 1, 2, . . 

• ,T 

for instances in general, whether 
they come from the sample, the 
population, or the universe 

In a sense the population (C P ,Z P ) of consumption and income 
as they actually happened in recorded and unrecorded history is 
merely a sample from the universe (C i} Z t ) of all possible worlds. 
Naturally, what we call the sample is drawn from the population 
of actual events, not from the hypothetical universe of level III. 
In most instances it does no harm to speak (and prove theorems) 
as if level I were picked directly from level III, not from level II. 

The Platonic universe of level III is indeed rather unseemly for 
the field of statistics (which is surely, in lay opinion, the most 
hard-boiled of mathematics, resting solidly on " facts") and has 
been amusingly ridiculed by Hogben. 1 

The next few paragraphs state why the abstract model of 
Nature's Urn is a less appropriate foundation for econometrics 
than for statistical physics or biology. But the rest of the book 
goes merrily on using the said Urn. 

Economic and physical phenomena alike take place in time. 

1 Lancelot Hogben, F. R. S., Statistical Theory: The Relationship of Probability, 
Credibility, and Error; An Examination of the Contemporary Crisis in Statistical 
Theory from a Behaviourist Viewpoint, pp. 98-105 (London: George Allen & Unwin 
Ltd., 1957). 


In both fields, the statement that u t is a random variable for each 
t is inevitably an abstraction, because time runs on inexorably. 
In the physical sciences events are deemed "repeatable," or aris- 
ing from a universe "fixed in repeated samples," primarily 
because the experimenter can ideally replicate exactly all system- 
atically significant conditions that had surrounded his original 
event. This is not possible in social phenomena of the " irre- 
versible" or "progressive" type. Although in the physical 
sciences it may be safe to neglect the difference between popula- 
tion and universe, it is unsafe in econometrics. For, as economic 
phenomena take place in time, all other conditions, including 
the exogenous variables, move on to new levels, often never to 
return. The common-sense phrase " on the average over similar 
experiments" makes much more sense in a laboratory science 
than in economics. 

Nature's Urn also supports maximum likelihood, variance of an 
estimate, bias, consistency, and many other notions we shall have 
occasion to introduce in later chapters. All these rest on the 
notion of "all conceivable samples." The class of all conceivable 
samples includes first of all samples of all conceivable sizes; it also 
includes all conceivable samples of a given size^ say, 4. A sample 
of size 4 may consist of points that actually happened (if so, they 
are in the population) ; it also could consist (partly or entirely) 
of points in the universe but not in the population. The latter 
kind of sample is easy to conceive but impossible to draw, because 
the imagined points never "happened." Therefore, even a com- 
plete census of what happened is not enough for constructing an 
exhaustive list of all conceivable samples. 

Assumption 3. The variance of u t is constant over time. 

This means merely that, in each year, u t is drawn from an identical 
Urn, or universe. This assumption states that the causes underlying 
the random term remain unchanged in number, relative importance, 
and absolute impact, although, in any particular year, one or another 
of them may fail to operate. 

For simplicity's sake we have assumed no errors of measurement. 
In fact there may be some, and their typical size could vary system- 
atically with time (or with the independent variable Z). If we try 


to measure the diameter of a distant star, our error of measurement is 
likely to be several million miles; when we measure the diameter of 
Sputnik, it can be only a few feet. Likewise, if our data stretch from 
3850 to 1050, national income increases by a factor of 20. It is quite 
likely that errors of measurement, too, should increase absolutely. 
If they do, Assumption 3 is violated, and some of the techniques that 
I develop below should not be used. 

Digression on infinite variances 

The variance of u is not only constant but finite. When u is 
normal it is unnecessary to stipulate that its variance is finite, 
because all nondegenerate normal distributions have finite vari- 
ance. There exist, however, nondegenerate distributions with 
zero mean and infinite variance, for example, the discrete random 

. . . , -16, -8, -4, 4, 8, 16, 

with probabilities, respectively, 

U» \/« U V l 



H, K, H, He, 

The central limit theorem, according to which the sum of N 
random variables (distributed in any form) approaches the 
normal distribution for large N t is valid only if the original distri- 
butions have finite variances. 

Assumption 4. The error term is normally distributed. 

This is a rather strong restriction. We impose it mainly because 
normal distributions are easy to work with. 

Digression on the univariate normal distribution 

The single- variable normal distribution is shaped like a sym- 
metrical two-dimensional bell whose mouth is wider than the 
mouth of anything you might name. Normal distributions come 
tall, medium, and squat (i.e., with small, medium, or large vari- 
ances). And the top of the bell can be over any point of the 


w axis; that is to say, the mean of the normal can be negative, 
zero, or positive, large or small. This distribution's chief charac- 
teristic is that extreme values are more and more unlikely the 
more extreme they get. 

For instance, the likelihood that all the people christened 
John and living in London will die today is extremely small, and 
the likelihood that none of them will die today is equally small. 
Now why is this? Because London is not under atomic attack, 
the Johns are not all aboard a single bus, not all of them are diving 
from the London Bridge, nor were they all born 85 years ago. 
Each goes about his business more or less independently of the 
others (except, perhaps, father-and-son teams of Johns), some old, 
some young, some exposing themselves to danger and others not. 
The reason why the probability that w of these Johns will die 
today approximates the normal is that there are very many of 
them and that each is subjected to a vast number of independent 
influences, like age, food, heredity, job, and so forth. This 
probability would not be normal if the Johns were really few, 
if the causes working toward their deaths were few, or if such 
causes were many but linked with one another. 

The assumption that u is normal is justified if we can show that the 
variables left out of equation (1-2) are infinitely numerous and not 
interlinked. If they are merely very many and not interlinked, then 
u is approximately normal. If they are infinitely many but enough 
of them are interlinked, then u is not even approximately normal. 
We often know or suspect that these variables, such as wealth, liquid 
balances, age, residence, and so forth, are quite interlinked and are 
very likely to be present together or absent together. 

Sometimes the following argument is advanced: In our model of 
consumption, the error term u stems from many sources; that is, we 
have left out variables, we have left out equations, we have linearized, 
we have aggregated, and so on. These are all different operations, 
presumably not linked with one another. Therefore, u is normally 

This argument is, of course, a bad heuristic argument, and it does 
not even stand for an existing (but difficult) rigorous argument. It 


Is logically untidy to count as arguments for the normality of u, on one 
and the same level of discourse, such diverse items as the fact of 
linearization and the number of unspecified variables that affect 

The assumption stands or falls on the argument of many non- 
interlinked absent variables. Most alternative assumptions cause 
great computational grief. 

Assumption 5. The random terms of different time periods 
are independent. 

This assumption requires that in each period the causes that deter- 
mine the random term act independently of their behavior in ail 
previous and subsequent periods. It is easy to violate this assumption. 

1. The error term includes variables that act cyclically. If, for 
example, we think consumption has a bulge every 3 years because that 
is how often we get substantially remodeled cars, this effect should be 
introduced as a separate variable and not included in u. 

2. The model is subject to cobweb phenomena. Suppose that 
consumers in year 1 (for any reason) underestimate their income, so 
that they consume less than the theoretical amount. Then in year 2 
they discover the error and make it up by consuming more than the 
theoretical amount of year 2; and so on. 

3. One of the causes behind the random term may be an employee's 
vacation, which is usually in force for 2 weeks though the model's 
unit period is 1 week. Any such behavior violates the requirement 
that the error of any period be independent of the error in all previous 

Assumption 0. The error is not correlated with any 
predetermined variable. 

To appreciate this assumption, suppose that (for whatever reason) 
sellers set today's price p t on the basis of the change in the quantity 
sold yesterday over the day before; that is, 

Vt = « + 7te-t - qt-z) + u t 
Suppose, further, that the greater (and more evident) the change in q 


the more they strive to set a price according to the above rule. Such 
behavior violates Assumption 6. 

We can think of examples where behavior is fairly exact (small u's) 
for moderate values of the independent variable, but quite erratic for 
very large or very small values of the independent variable. I am 
apt, for instance, to stumble more if there is either too little or too 
much light in the room. This, again, violates Assumption 6, because 
the error in the stochastic equation describing my motion depends on 
the intensity of light. 

So we come to the end of our statistical assumptions about the error 
u. When in future discussion I speak of u as having "all the Simplify- 
ing Properties" (or as "satisfying all the Simplifying Assumptions"), 
I mean exactly these six. 

Certain of these six assumptions can bo checked or statistically 
verified from a sample; others cannot. I shall return to this topic 

Of these assumptions only Assumption 1 is obligatory. There are 
decent estimating procedures for other sets of assumptions. 

1.10. Mathematical restatement of the Six Simplifying 

1. u t is random for every t: Some p(u) is defined for all % such 

< p < 1 and J p(u) du = 1 

2. The expected value of u t is zero: 

tu t = for all t 

3. The variance (r„«(0 is constant in time, and finite: 

< <T U u{t) = cov (u h u t ) = <Tuu < °° for all t 

4. u t is normal: 

p(u) - (2t)->* det (*«u)->* exp [-^(w - ew)((r„«)- I (w - 8w)] 
I explain this fancy notation in the next chapter. I use it because it 


generalizes very handily into many dimensions. The usual way to 
write the normal distribution is 

The symbol <r„ u is the square of cr; <r tttt is the variance of u. 

5. u is not auto correlated: 

Z(u t> u t -e) = for all t and for 6 j£ 

6. u is fully independent of the variable Z\ 

cov (u t ,Zt-e) = for all t and all 

1.11. Interpretation of additivity 

The random term u appears in model (1-2) as an additive term. 
This fact rules out interaction effects between u and Z. Absence of 
interaction effects means that, no matter what the level of income Z 
may be, a random term of a given magnitude always has the same 
effect on consumption. Its impact does not depend on the level of 

1.12. Recapitulation 

We must be very clear in econometrics, as well as in other areas of 
statistical inference, about what is assumed, what is observed, and 
what is guessed, and also about what criterion the guess should satisfy. 
Table 1.1 provides a check list of things we accept by assumption, 
things we can and cannot do, and things we must do in making 
statistical estimates. The items in the first three columns have been 
introduced in this chapter; the estimating criteria in the fourth column 
will be discussed in the next chapter. 

Digression on the differences among moment, 
expectation, and covariance 

Consider two variables, consumption c and yesterday's income 
2. They may or may not be functionally related. They have a 



Table 1.1 

Thebb things are 


These things are 

These things 


This is 

These things aru 
computed by us 

That a true a and a 
true 7 exist 

The true a 
and 7 

Some estimat- 
ing criterion 
for comput- 
ing a, 7, S 

a, a guess as to « 
7, a guess as to 7 

That a true «< exists 
in each time period 

The true u< 

The residuals SU 
(4 - 1,2, , , , ,B) 

That ut has the 


«u, the ex- 
pected value 

ff«u, the vari- 
ance of the 

(%%,)/$, the mean 
of the rosiduals 5* 

tnzz, the moment of 
the residuals h. 

That there is a 
universe d, Zi 

(t - 1,2 /) 

in which 

Ci " a + yZi + W 

The C's and Z'» 

of the sample 
denoted by C, Z, 
(« - 1,2 S) 

The C's and 
Z'b not in 
the sample 


U - f Cl ' 

• • > 

• • 9 


The average value of c in the universe is symbolized by Zc (read 
"expected value of c" or "expectation of o"). Similarly, e« is 
the average z m J/te universe, 

The covariance of c and z is defined as the expected value 

Efe - ic)(* - es) 

where i runs over J/te entire universe. This is symbolized by 
cov (c,0) or <r Cf . 


The variance of c is simply the. covariance of c and c. It is 
written var c, or cov (c,c), or (r cc . 


Now consider a specific sample S° made up of specific (corre- 
sponding) pairs of consumption and income, for instance, 

[C27, C64, ClOS I 
Z27, 254, Z105J 

Let the sample means for this particular sample be written c° 
and z°, respectively. 


The moment (for sample S°) of c on t is defined as the expected 

E($ - c')(z 9 - 2?) 

where s runs over 27, 54, and 105 only. It is symbolized by 
m c . e (S°) or m c . f or simply m ct . Of course, a different sample S 1 
would give a different moment m c . t (S l ). 

Expectation of a moment 

Now consider all samples of size 3 that we can draw (with 
replacement) from the universe U. Then the expectation of m C9 
is the average of the various moments m e . M (S°), m c .,(S 1 ), etc., 
when all conceivable samples of size 3 are taken into account. 

A universe with J elements generates f ) such samples, and the 

means c and z of the two variables vary from sample to sample. 

The expectation of m ct for samples of size 4 is, in general, a 
different value altogether. 

Much confusion will be avoided later on if these distinctions 
are kept in mind. Clear up any questions by doing the exercises 


l.A If c, c', c", c* are four independent drawings (with replace- 
ment) from the universe U,^prove~ that e(c' -f c" + c*) — 3 ec. 

l.B If c, z } q are variables and A; is a constant, which of the follow- 
ing relations are identities? 


cov (c,z) = cov («,c) • J. 

'met = m»e • 

e(c -f z) = sc -f ez • v 

cov (A;c,^) = A; cov (c,z) • 

var (ft?) = k 2 var 5 , x 

m(*e).a = fcm M • 

cov (c + $,«) = cov (c, z) 4- cov (q,z) / 

Further readings 

Tlic art of model specification is learned by practice and by studying 
cleverly contrived models. Beach, 1 Klein, Tinberoen, and Tintnee give 
several examples. L. R. Klein and A. S. Goldberger, An Econometric Model 
of the United States: 1929-1952 (Amsterdam: North-Holland Publishing 
Company, 1957), present a celebrated large-scale econometric model. Chap- 
ters 1 and 2 give a good idea of the difficulties of estimation. The performance 
of this model is appraised by Karl A. Fox in "Econometric Models of tha 
United States" {Journal of Political Economy ; vol. 44, no. 2, pp s 128-143, 
April, 195G) and by Carl Christ in "Aggregate Econometric Models" (American 
Economic Review, vol. 46, no. 3, pp. 385-408, June, 1956). 

"The Dynamics of the Onion Market," by Daniel B. Suit! and Susumu 
Koizumi (Journal of Farm Economics, vol. 38, no. 2, pp. 475-484* May, 1956)^ 
is an interesting example of econometrics applied to a particular market In 
the short run. 

Kendall, chap. 7, reviews the logic of probability, sampling, and expected 
value. For a lucid discussion of the concept of randomness, iej M« G. 
Kendall, "A Theory of Randomness" (Biomelrika, vol. 32, pt. 1, pp. 1-15, 
January, 1941). 

As far as I know, the assumptions about the random terra have not been 
discussed systematically from the economic point of view, except for Mar- 
schak's brief passage (pp. 12-15) in Hood, chap. 1, sec. 7. See also Gerhard 
Tintncr, The Variate Difference Method (Cowles Commission Monograph 5, 
pp. 4-5 and appendixes VI, VII, Bloomington, Indiana: Prineipia Press, 
1940), and Tintner, "A Note on Economic Aspects of the Theory of Errors 
in Time Series" {Quarterly Journal of Economics, vol. 53, no. 1, pp. 141-149, 
November, 1938). 

As defined in economics textbooks, the production function and the cost 
function necessarily violate Assumption 2. In no instance (whether in the 
universe, the population, or the sample) can the random disturbance exceed 
zero in the production function and fall short of zero in the average cost 

1 See Frequent References at front of book. Works of authors whose names 
are capitalized are listed there. 


function. All statistical studies of production and cost functions I know of 
have implicitly used the assumption that zu «■ 0. The error is in the assump- 
tion of normality. Sco, for instance, Joel Dean, "Department Store Cost 
Functions," in Studies in Mathematical Economics and Econometrics, in memory 
of Henry Schultz, edited by Oscar Langc, Francis Mclntyre, and Theodore 0. 
Yntema (p. 222, Chicago: University of Chicago Press, 1942), which is also an 
interesting attempt to fit static cost functions to data from years of large 
dynamic changes. In this respect I was guilty myself in "An Econometric 
Model of Growth: U.S.A. 1869-1953" {American Economic Review, vol. 45, 
no. 2, pp. 208-221, May, 1955). 

For examples of nonadditive disturbances, see Hurwicz, "Systems with 
Nonadditivo Disturbances," chap. 18 of Koopmans, pp. 410-418. 


Estimating criteria and the method 
of least squares 

2.1. Outline of the chapter 

This chapter, like the previous one, deals exclusively with single- 
equation models. Unless the contrary is stated, all the Simplifying 
Assumptions of Sec. 1.9 remain in force. The main points of this 
chapter are the following: 

1. Once we have specified the model and made certain stochastic 
assumptions, our sample tells us nothing about the unknown parame- 
ters of the model unless we adopt an estimating criterion. 

2. A very reasonable (and hard to replace) criterion is maximum 
likelihood. It is based on the assumption that, while we were taking 
the sample, Nature performed for our benefit the most likely thing, 
or generated for us her most probable sample. 

3. Once the maximum likelihood criterion is adopted, we can tell 
precisely what the unknown parameters must be if our sample was 
the most likely to turn up. This is what is called maximizing the 
likelihood function. We find the unknowns by manipulating this 



4. The familiar least squares fit arises as a special case of the oper- 
ation of maximizing the likelihood function. 

5. In many cases, adopting the maximum likelihood criterion auto- 
matically generates estimates that conform to other estimating 
criteria, for example unbiasedness y consistency, efficiency. 

6. If estimates of the unknown parameters are unbiased, consistent, 
etc., this does not mean that our particular sample or method has 
given us a correct estimate. It means that, if we had infinite facilities 
(or infinite patience), we could get a correct estimate "in the long run" 
or "on the average." 

7. The likelihood function not only tells us what values of the 
parameters give the greatest probability to the observed event but 
also attaches to such values degrees of credence, or reliability. 

Though tinse statements can be made about all sorts of models, 
the single-equation model of consumption that I have been using all 
along captures the spirit of the procedure. Multi-equation models 
have all the complications of single-equation models plus many others. 

2.2. Probability and likelihood 

In common speech, probability and likelihood are but Latin and 
Saxon doublets. In statistics the two terms, though often inter- 
changed for the sake of variety or style, have distinct meanings. 
Probability is a property of the sample; likelihood is a property of the 
unknown parameter values. 


Imagine that, in a model that described Nature's workings perfectly, 
the true values of the parameters a, /?, 7, . . . were such and such 
and that the true stochastic properties of the error term u were such 
and such. We would then say that certain types of natural behavior 
(i.e., certain samples or observations) were more probable than others. 
For example, if you knew that a river flowed gently southward at a 
speed of 3 miles per hour, that an engineiess boat drifting on it had 
such and such dimensions, weight, and friction (the model); if, in 
addition, you knew that gentle breezes usually blow -in the area, very 


rarely faster than 5 miles per hour, and that they usually blow now 
in one, now in another direction (the stochastic properties) ; then you 
would be very much surprised to find an instance in which the boat 
had traveled 25 miles northward or 30 miles southward in the space 
of 2 hours (the improbable behavior). 


Now reverse the position. If you were sure of your information 
about the wind, if you did not know which way or how fast the river 
flowed, but you observed the boat 28 miles south of where it was 
2 hours ago and were willing to assume that Nature took the most 
probable action while you happened to be observing her, then you 
would infer that the river must have a southward current of 14 miles 
per hour. This is the maximum likelihood estimate, or most likely 
(NOT most probable) speed of the river on the evidence of this 
unfortunate sample. Any other southward speed and any kind of 
northward flow are highly unlikely, or less likely than 14 miles per 
hour south. 

To say that any other speed is less probable is to misuse the term. 
The river's speed is what it is (3 miles per hour to the south) and it 
cannot be more or less probable. What can be more or less probable 
is the particular observation: that the boat has traveled southward 
28 miles. This observation is very improbable if the river indeed 
flows 3 miles per hour southward. It would be more probable if the 
river flowed southward with a speed of 5, 7, or 10 miles. And it 
would be most probable if the true speed of the river had been 14 miles 
per hour. Evidently, a maximum likelihood guess can be very far 
from the truth. 

All estimation in econometrics operates as in this river example, no 
matter how elaborate the model, sloppy or exquisite the sample. 

What is so commendable about the maximum likelihood criterion, 
if it cannot guarantee us correct or even nearly correct results? Why 
assume that Nature will do the most likely thing? All I can say to 
this is to ask: Well, what shall we assume instead? the second most 
likely thing? the seventy-first? 

It is true that (in some cases) maximum likelihood estimates tend 
to be correct estimates "on the average" or "in the long run" (see 
Sees. 2.1 and 2.10)". These facts, however, are irrelevant, because we 


use the maximum likelihood criterion even when we plan neither to 
repeat the experiment nor to enlarge the sample. 

It is very important to appreciate just what maximum likelihood 
estimation docs: The experimenter makes one observation, say, that 
the boat had traveled 28 miles southward in 2 hours; he then asserts 
hopefully (he does not know this) that the wind has been calm, because 
this is the most typical total net wind speed for all conceivable 2-hour 
stretches; and so he lets his estimate of the speed be 14 miles. 

Actually, we (who happen to know that the true speed is 3 miles) 
realize that, while the experimenter was busy measuring, the weather 
was not at all typical but happened to be the improbable case of 
2 hours of strong southerly wind. 

The same experimenter under different circumstances might esti- 
mate the speed to be G, 0.5, —2, —3.0, etc., miles per hour depending 
on the wind's actual whim during the 2-hour interval in which observa- 
tion took place. 


2.A Set up an econometric model of the river-and-boat example 
of Sec. 2.2, using the following symbols: dt for the number of miles 
(from a fixed point) traveled southward in t hours by the boat, y for the 
(unknown) speed of the river in miles per hour, and u t for the net 
southbound component of the wind's speed in miles per hour. Let u t 
have the following stochastic specification : 

10 per cent of the time u t — 11 (southbound) 

70 per cent of the time u t = (calm) 

10 per cent of the time u t = — 5 (northbound) 

10 per cent of the time u% = — 6 (northbound) 

Construct a probability table giving the net wind effects for 2 hours in 
succession. For each typr< of conceivable observation, derive the 
maximum likelihood estimate of 7. 

Digression on the multivariate normal distribution 

The univariate normal distribution for a variable u with 
universe mean eu and variance «r uu was written in Sec. 1.10 in the 
fancy form 



p(u) = (2ir)-w det (*„„)-» exp [-K(w - ctiXO" 1 ^ - *01 


because of the ease with which it generalizes to the multivariate 

Let u\ y Ui, . . . , un be N variables which have a joint normal 
distribution. Define 

U = VeC (UlyUiy . . . } un) 

eu « vec (ewi,£W2, . . . ,zun) 

<r uu — 


J^U N Ui 

For <r„ lttJ we often write cov (wi,W2), or simply c n if the meaning 
is clear from the context. Sometimes the inverse of <r att , usually 
written (O" 1 * is written <r uu , and its elements are written a***' 
or just o^. These superscripts are not exponents. If we need 
to write an exponent we write it outside parentheses, as in equa- 
tion (2-1). 

To get the multivariate distribution for u\ f Ut f . . . , us, all 
we need to do is change the italic u f 8 of (2-1) into bold characters: 

p(u) = (2tt)-"/ 2 det (cr uu )->* exp [-*4(u - eu)(cr uu )-»(u - Su)] 


This illustrates the principle noted in Sec. 1.2: that if an oper- 
ation, theorem, property, etc., holds for simple numbers, it holds 
analogously for matrices. This is a great convenience, because 
you can pretend that matrices are numbers and so collapse a 
complicated formula into a shorter and more intuitive expression. 
Moreover, by pretending a matrix is a number, you can get a 
clear impression of what a formula conveys. 


2.B Write explicitly the joint normal distribution of the two 
variables x and w. 
2.C In Exercise 2.B, modify the formula for <r xx = aw and o-,,* » 0. 


2.D Write in vector and matrix notation the formula 

N N 

— H 2, 2 ( Wm *"" eW ^ nn (Pn— tW n ) 
tn- 1 n — 1 

2.3. The concept of likelihood function 

Consider again the river-and-boat illustration of the previous sec- 
tion. Our information can come in one of several different ways. 

1, A sample of one observation 

Someone may have sighted the boat at the zero point at twelve 
o'clock and 28 miles south of that 2 hours later. This is one observa- 
tion; it leads to i — 2 % = 14 miles per hour southward as the maxi- 
mum likelihood estimate of the river's speed. The number of hours 
elapsed from the beginning to the end of the observation could have 
been 1 or J^ or 7 or anything else. 

2. A sample of several independent observations 

We may have several observations like the above but made on 
different days. For instance, 

Observation Time elapsed Distance traveled 


2 hours 

28 miles south 


4 hours 

12 miles south 


17 hours 

44 miles south 

3. Several interdependent observations 

Observations may overlap ; as, for example, 

Observation Time of observation Distance traveled 

a 12 to 2 p.m. 28 miles south 

b 1 to 5 p.m. 20 miles south 

of the same day 

Or information may come in even more complicated ways. The likeli- 
hood function can be constructed only if we know both the circum- 
stances of our observations and the readings derived from them. 
Cases 1, 2, and 3 lead to different likelihood functions because the 


circumstances differ. Two observers, each of whom watched the boat 
for 2 consecutive hours unbeknownst to the other, would set up two 
likelihood functions identical in form into which they would feed 
different readings. But each investigator would set up on« and r uy 
one likelihood function. This is a function of a single sample^ th$ 
sample, his sample; no matter how independent, complicated, or 
interdependent his observations may be, they form a single sample. 
The maximum likelihood criterion tells us to proceed as if Nature 
did the most probable thing. We assert this about the totality of 
observations in the sample rather than about any single observation. 

2.4. The form of the likelihood function 

Return to the consumption model C t = a + yZ t 4* t*i. Tht3 fol* 
lowing statement must be accepted on faith (its proof is a decpish 
theorem in analysis): Under the assumptions that Nature conforms 
to the model and that the true values of the parameters are a and y t 
the probability of observing the particular sample Ci, ft, . . , , C& 9 
Zi, Z*, . . . , Z$ is equal to the probability that the error term shall 
have assumed the particular values u%, u 2 , . . . , u$ multiplied by & 
factor det J. 

The term det J happens to be equal to 1 in all single-*-^ 
cases; so we need not worry about it yet. It becomes important }i 
two-or-morc-cquation models* 

The statement cited above is of immense and curioM§ ^gniflcmnee^ 
We observe the sample C„ Z,. But we cannot know (HfrvAlj h®W 
probable or improbable it is to obtain this particular Simple, s\m<i 
all our stochastic assumptions have to do with the probability distribu- 
tion of the u's } not of the C's and Z's. On the other hand, Wi mm 
never observe the random errors themselves. So one might despciir 
of finding the probability of this particular sample but far tfm remark- 
able property cited. Let L stand for the probability ©f the sample 
and q for the probability that the random term will take cm the values 
Wii w 2 , . . . , Us. Then we have 

L = det J • q(ui,u 2 , ... y u s ) (24) 

Now, the (unobservable) w's are functions of the (observed) C's 
and Z's and of (the unknown) a and 7, because the model implies 



Ut = Ct — a — yZ t . To maximize likelihood is to seek the pair of 
values of a and 7 that makes L as large as possible. 

What form q(u h . . . ,u s ) takes depends on the stochastic assump- 
tions about the error term. 

This concludes the discussion of the logic behind maximum likelihood 
estimating of a and 7. 

In the next few pages I discuss the mechanism of maximizing L under 
the Six Simplifying Assumptions. On a first reading, you might skip 
the rest of this section without serious loss. Readers who wish to 
refresh your manipulative skills, read on! We shall now omit writing 
det J = 1, since we are discussing only single-equation cases at this 

By Simplifying Assumption 4, the random terms u h u% } . . . , us 
come from a multivariate normal distribution. Therefore (2-2) 
applies, and 

L - q(u h u h . . . ,u s ) = (2ir)sn det (<r uu )-*, ^ 4 

exp [-y 2 (u - eu5V uu )- l (u - en)] (2-4) 

By Simplifying Assumption 2, Zu t — 0. Simplifying Assumption 3 
states that all diagonal elements of o- uu are equal to a finite constant 
(T uuy and Assumption 5 states that all nondiagonal elements are zero; 

det (cr uu )->* = (*uu)- Sf * 

Therefore, (2-4) reduces to 

L - (2ir)- s >*(v uu )- s '* exp 

•1 1 

and finally to 


L = (2t)-.s/V««)- 5 " exp [-^((r Wtt )- 1 ^ w 2 .] (2-5) 


The following properties of L will not be proved: 

1. L is a continuous function with respect to a, 7, cr„„ except at 
<r« tt = 0. This means that it can be differentiated quite safely. As 
for the exception, we need not worry about it; for w, as a random 
variable, assumes at least two distinct values in the universe, and 


therefore c U u > 0. If the sample is of only two observations, the fit 
is perfect; m uu is zero — but in that case we do not use the likelihood 
approach at all. We just solve the two equations C\ =» a + yZi and 
Ci = a 4" 7Z2 for the two unknowns a and 7. 

2. Setting the partial derivatives of L equal to zero locates its 
maxima. It has no minimum; therefore, we do not need to worry 
about second-order conditions of maximization. 

3. L is a maximum when its logarithm is a maximum. So, instead 
of (2-5), we maximize the more convenient expression 

log L m - - log 2ir - g log <r«„ - g (*««)~ l \ u \ (^ 6 ) 

4. The'true values of a, 7, and <r uu are not functions of one another, 
but constants. Therefore, in maximizing, all partial derivatives? of m, 
y, and <r with respect to one another are zero. 

Maximizing (2-6) results in 

V (C. - a - yZ.) - 
V (C. - « - 7^)2. - 0' (2 .7) 

J V (C. - « ~ 7^.) 2 - *.. 

The solution of (2-7) for a, 7, o^m gives the maximum likelihood esti- 
mate <$, ^, £. 

2.5. Justification of the least squares technique 

It is evident that (2-6) gives least squares estimates for a, 7, a. 

System (2-7) says that the maximum likelihood values of a and 7 
are the values that minimize the sum of the squares of the residuals #,. 
The last equation in (2-7) states that the maximum likelihood esti- 
mate & uu of the true variance o-„„ is the average square residual. 

This, then, is the justification for minimizing squares. Remember 
that to get this result we had to make use of a great many assumptions 
both about the model itself and about the nature of its error term. 
If any one of these many assumptions had not been granted, we 
might not have reached this result. Therefore, one should not go 


about minimizing squares too lightheartedly. For every different set 
of assumptions a certain estimating procedure is best, and least 
squares is best only with a proper combination of assumptions. Con- 
versely, every estimating procedure contains in itself (implicitly) some 
assumptions either about the model, or about the distribution of w, 
or both. 1 

Digression on computational arrangement 


It pays to develop a tidy scheme for computing d, 1, and & 
because computation recipes similar to (2-7) turn up pretty often. 

It is always possible to arrange the computations in such a 
way as to estimate the coefficient y of the independent variable 
first. With •? in hand, one computes the constant term d. 
Finally, with d and 1, one computes the residuals #„ and from 
these residuals, an estimate of <r uu . 

An analogous procedure for models having several independent 
variables (and, hence, several 7s) is developed in the next 

In all cases, that is to say, for simple as well as for complicated 
models, I shall describe only the computational steps for estimat- 
ing the 7s (coefficients of the independent variables). 

Write (2-7) as follows: 

aS + yXZ, = 2CV 
aZZ, + yZZ\ = 2C,Z. 

where the sums run over the entire sample. Now subtract 2Z. 
times the first equation from S times the second. The result is 

7 [£2Z 2 - (2Z)(2Z)] - [S2CZ - (2C)(2Z)] (2-8) 

Note that we have eliminated a and that, moreover, in the 
square brackets we may recognize the familiar moments, defined in 
Chap. 1. Thus (2-8) is equivalent to 

1 The Six Simplifying Assumptions are sufficient but not necessary conditions 
for least squares. Least squares is a "best linear unbiased estimator" under 
much simpler conditions. This, however, is another subject. I chose these 
particular six assumptions because with them it is easy to show how a stochastic 
specification and an estimating criterion lead to a specific estimate of a parameter 
rather than to some other estimate. 


yvftzz = m>cz 
and the estimate of 7 can be expressed very simply as 

<? = (m zz )- l m zc or ^£ (2-9) 


which, besides being compact, generalizes easily to N dimensions, 
i.e., by replacing the Greek and italic letters by the correspond- 
ing characters in boldface type : 

? - (m Z z)- 1 mzc or ^ (2-9a) 


2.6. Generalized least squares 

All the principles discussed so far apply to all linear models consisting 
of a single equation. To treat the general case, we shall make a slight 
change in notation: y will stand for any endogenous variable (the role 
played by consumption C so far) and z for any exogenous variable 
(the role played by lagged income Z so far). 

Let us suppose that the endogenous variable y(t) depends on // 
different predetermined variables Z\(t), 22(0 > . . . , zn(t) as follows 
(omitting time indexes) : 

y = a -f 7i2?i -|- 72^2 + ■•'••.+ ynZ H + u (2-10) 

Indeed, the analogy of (2-10) with C = a -f- yZ + u is so perfect 

that everything said about the latter applies to the shorthand edition 

of (2-10): 

y = a + yz + u 

where y is the vector (71,72, . . . ,7/y) and z is the vector 1 (21,2s, 

. • • ,2//). 

But we must be careful. The first five Simplifying Assumptions 
(u is random, has mean zero, has constant variance, is normal, and is 
serially independent) need no alteration* Assumption 6 must, how- 
ever, be changed to read as follows: The error term u t is fully inde- 
pendent of Zi(jt) t 2 2 (0> • • • > 2#(0. 

Under the new version of the Simplifying Assumptions, the maxi- 

1 For typographical simplicity I shall not bother, in obvious cases, to dis- 
tinguish a column from a row vector. In this case z is a column vector. 



mum likelihood criterion leads to the estimate of 71, 72, ... , in 
that minimizes the sum of the squared residuals. And, moreover, 
these estimates are given by the formula 

t - (m,,)-^, 


which is exactly analogous to (2-9). What these boldface symbols 
mean is explained in the next digression. 

Digression on matrices of moments and their determinants 

This is a natural place to introduce some extremely convenient 
notation, which we shall be using from Chap. 6 on. 

If p, q, r, x, y are variables, m^ p , q , r ).( XlV ) is the matrix whose 
elements are moments that can be constructed with p, q, r on x 
and y. The variables in the first parentheses correspond to the 
rows a^d those in the second to the columns. Thus, 

m (p.«.r)-(x.v) 




m rx 


m qy 

Wry j 

The middle dot in the subscript may be omitted. 

Likewise, m« means the matrix whose elements are moments of 
the variables z if z 2 , . . . , zh on themselves: 

m«« = 







and m™ means 





m : 


.m t *u- 


Every square matrix has a determinant. So does every square 
matrix of moments, for instance, m M ; for the determinant of m 
we write det m, perhaps with the appropriate subscripts dct m n , 
or det m< Zl ,*, * B )Uu*% **>• 

But it is simpler to write m„ instead of det m or det m„; 
and we shall do this for compactness. The lightface italic m in \ 
the expression m zz indicates that the determinant is a simple 
number, like 2 or 16.17, and neither a vector nor a matrix of 
numbers (these are printed bold). 

One way to estimate the coefficients 71, 72, . . . , 7# is to 
perform the matrix operations given in (2-11). Another way 
is by Cramer's rule } which calculates various determinants and 

71 ■= 

™(fn*i *m)(*i. » *h) 

A rtl( tx ,y, ... . t „ )(tl, t t f//) 

72 = 

W( Zl>f|( . , . , f tf )(*,,*, g B ) 


* __ Wjtu't l/)(gl.«| *b) 

iff yy. 

"*(»li*| *b)Ui>** *b) 

Both these ways are very cumbersome in practice for equations 
with more than three or four variables, unless we have ready 
programs on electronic computers. Appendix B gives a stepwise 
technique for calculating ^1, ^2, . . . , 1h that can be used on 
an ordinary desk calculator. 
Matrix inversion is discussed in Appendix A. 

2.7. The meaning of unbiasedncss 

Let us discuss bias and unbiasedncss by using the original model of 
consumption C t = a + yZ t -f- u>t with the understanding that all 
conclusions hold true for the generalized single-equation model 
y = a + 71Z1 + * * ' + IhZh + u. Furthermore, we can restrict our- 
selves, with some exceptions, to the discussion of 7, because the 
statements to be made are also true of a. 


Imagine that we obtain our guess ^ of the parameter 7, violating 
none of the Simplifying Assumptions. The guess so chosen is the 
most likely in the circumstances. But this does not guarantee it to 
be equal to the true value 7. This is so because the observations 
C„ Z, we have to go on are just a sample. And in sampling anything 
can happen. Extremely atypical misleading samples are improbable 
but perfectly possible. So it makes sense to ask how f ar off the guess 
i is likely tojbe from the true value 7. 

Here it is very important to distinguish between (1) taking again 
and_agcoin a sample of size S, (2) taking bigger and bigger. samples 
( one of ea ch size) . The first procedure is connected with the important 
statistical notion of bias, the second with that of consistency. Both 
procedures are ideal and impractical, because such samples must be 
taken from the universe (level III) and not merely from the population 
(level II). Therefore, even with infinite resources and infinite 
patience, the concepts are not operational. 

Consider any estimating recipe (say, least squares). Choose a 
sample size, say, S = 20; draw (from the universe) all possible samples 
of size 20; for each sample compute (by least squares) the correspond- 
ing <$; then average out these fs. If the average 'f equals the true 7, 
then we say that the procedure of least squares is an unbiased method 
for estimating 7, or an unbiased estimator of 7, for sample size S — 20. 
Loosely, we might saj r that on the average, least squares gives a correct 
estimate of 7 from samples of 20 observations. 

In order to pin down firmly the concept of bias, I have constructed 
a purposely simple and exaggerated example. It involves just three 
time periods, very uneven disturbances from time period to time 
period, and a random disturbance that assumes just three different 
values. Yet this example illustrates all that could be shown with a 
larger and more realistic one. 

Assume that the true values of the parameters we seek to estimate 
are a = 4, 7 = 0.4. Assume that the population consists of exactly 
3 elements, labeled a, b, c, whose coordinates are given in Table 2.1 
below; the three points are shown in Fig. 2. This population could 
have come from an infinite universe* but let us (for pedagogic reasons) 
deal with a finite universe that consists of the above three points 
a, b, c plus four more, which are named a', b' y c', and c". Every point 
of the universe is completely defined when we specify the random 



©c' and C" 

True (exact) relation 
Cj-4+0.4Z t 


10 15 20 

Fig. 2. A seven-point universe. Solid dots: points in the population. Hollow 
dots: points in the universe but not in the population. 

Table 2.1 
The population (a,b,c) 



Z v 

c v 

u p 














error u that corresponds to it and the level of the independent variable. 
These are given in Table 2.2. 

Table 2.2 




Points in the universe 
but not in the population 


of u t 


OF Zt 



OF Ut 


OF Zt 


























2.E Which Simplifying Assumptions are fulfilled by u t in the uni- 
verse of Table 2.2, and which are violated? 



Now let us see if the least squares method is an unbiased estimator 
of 7. First let us take all conceivable samples of size 2 and for each 
compute the least squares value *?» Samples should be taken in such a 
way that the same time period is not represented more than once. 

The population can yield only the following pairs: (a,b), (a,c), and 
(&,c). This is the most that a flesh-and-blood statistician, even one 
equipped with unlimited means, could obtain operationally, because 
points a', 6', c', and c" exist, so to speak, only in the mind of God. 
But the definition of bias requires us to check samples (of size 2) that 
include all points of the universe, human and divine alike. There are 
sixteen such samples, and the corresponding estimates & and 1 are 
given in Table 2.3 and plotted in Fig. 3. When all sixteen are con- 
sidered, it is seen that least squares is an unbiased estimator of 7 
(and of a). 

Table 2.3 
Estimates of a and y from samples of size 2 

Points in the sample 


OP 7 

OF a 

>a b 



a V 



■■" a c 



a e 



a c" 












a' c' 



a' c" 



.6 c 


8. 1600 < 

b c' 



b c" 



b' c 



V C ' 



b' c" 



Average of all conceivable samples 

57 - 0.4000 

e& - 4.0000 

Average of all feasible samples 


primed points) 



If we try all samples of size 3, we get the results tabulated in Table 2.4 
and plotted in Fig. 4. 



Table 2.4 
Estimates of a and y from samples of size 3 

Points in 



a b c 



a b c' 



a b c" 



a b' c 



a b'c' 



a b' c" 






a'b c' 



a' b c" 



a' b'c 



a'b c' 



a' b' c" 




e& - 4.0000 

e - 0.4000 

For a sample of size 3, the least squares method is an unbiased esti- 
mator of both 7 and a. 

In certain cases, not illustrated by our simple example, (1) an 
estimating technique (say, least squares) may be unbiased for gome; 
sample sizes and biased for other sizes; (2) a method may overestimate 
7 for certain sample sizes and underestimate it for others, on th$ 
average; (3) we may be able to tell a priori, knowing the sample size S, 
whether the bias is positive or negative (in other cases we cannot) | 
(4) a method may be unbiased for one parameter but biased for another. 

2.8. Variance of the estimate 

In Fig. 3 I have plotted all the estimates of a and 7 for -all posalbl© 
samples of size 2. The same thing was done for size B in Fig. 4* In 
general, the estimates are scattered or clustered, depending (I) on the 
size S of the sample, (2) on the size and other features of the universe, 
(3) on the particular estimating technique we have adopted, and 
ultimately (4) on the extent to which random effects dominate the 
systematic variables. Other things being equal, we prefer an esti- 
mating technique that yields clustered estimates. The spread among 
the various estimates $ is called the variance of the estimate •?, and is 


written <rtf or <r(1,1), or, sometimes, <r(W\S) if we want to emphasize 
what size sample it relates to. 
The variance is defined by 

and is a constant, which exists and can bo computed if the four items 
listed above arc known. Table 2.5 gives the values of <r for our 

















1 2 


4 5 6 7 a 







Fig. 3. Parameter estimates from all 
samples of size 2. $: double point. 







• fovy)-(4,0.4) 






1 2 

3 4 5 6 7 










Fig. 4. Parameter estimates from all 
samples of size 3. <f>: double point. 

seven-point example. Note the interesting (and counterintuitive) 
fact that the variance of the estimate can increase as the sample size 
increases! This quirk arises bec^usej^jn^Jhe example, the random 

disturbance has a skew distribution. If u is symmetrical, the variance 

of the estimate_decrea ses as th ej sarnp_le_ siz e increase s. 

Table 2.5 

Sue S op samples <r($,7) 



2.9. Estimates of the variance of the estimate 

If we have complete knowledge, we can compute the true value of 
(r(i rf\S) by making a complete list of all samples of size S, computing 
all possible estimates of 7, and finding their variance, as I did in the 
above example. In practice, however, it is impossible to exhaust all 
samples of a given size, because the universe contains points that are 
not in the population. So, instead, we must be content with gue&ding 
at the variance of the estimate by the use of whatever information is 
contained in the single sample we have already drawn. 

At first, you might suppose that estimating <r($rt\S) is logically 
impossible when you have a single sample of size S to work with, 
because, after all, the variance of the estimate of 7 represents what 
happens to 7 as you take all samples of size S. 

All is not lost, however, because a single sample of size S contains 
several samples (S of them) each of size <S-minus-l. The latter we can 
generate by leaving out, one at a time, each observation of the original 
sample. Thus, if the original sample is (a,b,c) of size S — 3, it con- 
tains three subsamples of size 2 each, the following ones: (a,6), (o,c), 
and (6,c), which yield, respectively, the three estimates f (a, 6), 1(a,c), 
and f(b,c). We get, then, some idea about variations in the estimate 
of 7 among samples of size 2. Still, we know nothing about the variance 
of 7 as estimated from samples of size 8. Here we invoke the maximum 
likelihood criterion. The original sample (a,6,c) was assumed to be 
the most probable of its kind, namely, the family of samples containing 
three observations each. If this is so, then observations a, 6, c 
generate the most probable triplet T = {(a,6),(a,c),(6,c)j of samples 
containing two observations each. Therefore, the variability of i 
(in the triplet T) estimates its variability in samples of size 3. 

From Table 2.3, 

jfafi) = 0.4875 

1(a,c) = -0.2464 

?(&,c) = -0.5400 

Average = -0.0997 

The variance of f in the sample triplet is equal to 

HK0.4875 4- 0.0997) 2 + (-0.2464 + 0.0997) 2 

+ (-0.5400 + 0.0997) *] - 0.1867 



The last figure must now be corrected by the factor £-minus-l «■ 2 if it 
is to be an unbiased estimate of the variance of $ (a,6,c). Estimates of 
variance based on averages, if uncorrected, naturally understate the 
variance. The proof that 

& - 0.1867 X 2 - 0.3734 

is an unbiased estimate of o-(f,f|3) is in Appendix C. 

In practice we are too lazy to estimate y again and again for all 
the subsamples. The formula cr(f,f \S) .= m^/(S — l)mzz gives a 
short-cut (and biased) estimate of the variance of 1 for samples of the 
original size. Table 2.6 lists these estimates for three-point samples 
and repeats some of the information from Table 2.4. 

Tabic 2.6 
Estimates of 7 and of the variance of its estimates 

Points in 


1(» M m ^ 


J( "' )u, (S-l)m^ 

a b c 



a b c' 



a b c" 



a b' c 



a b'c' 



a V c" 






a'b c' 



a' b c" 






a' b' c' 



a' b' c" 



Table 2.6 must be interpreted carefully. To begin with, the investi- 
gator will usually know only its first line, because he has a single 
sample to work with. The remaining lines are put in Table 2.6, for 
pedagogic reasons, by the omniscient being who can consider all 
possible worlds. Events, could have followed one or another course 
(and only one) among the courses listed in the several lines of Table 2.6. 
It just happened that (a,6,c) materialized and not some other triplet. 
It yielded the two estimates -f = —0.30288, a very wrong estimate, and 
ff = 0.02C05. The latter misleads us to believe in the likelihood of the 


If sample (a',h',c') had materialized, the two guesses would have 
been 1 = 0.73558 (not so bad as before) and 5 = 0.00256, which is 
ten times as "confident" as before. It is entirely possible for a 
sample to give a very wrong parameter estimate with a great deal of 
confidence. The mere fact that 5(1,1) is small does not make 1 a 
good guess. 

It is comforting, of course, to have some measure of how much 1 
varies from sample to sample. What is upsetting is that the measure 
is itself a guess. True, it is better than nothing, but this is no con- 
solation if by some quirk of fate we have picked a sample so atypical 
that it gives us not only a really wrong parameter estimate 1 , but also 
a really small 5(1,1). The moral is: Don't be cocksure about the 
excellence of your guess of y just because you have guessed that its 
variance v(1,1) is small. 

2.10. Estimates ad nauseam 

Note carefully now that, whereas cr(1,1) is a constant, 5(1,1) is not, 
but varies with each sample of the given sizeJ Therefore 5(1,1) itself 
has a variance, which we may denote by c(5(1,1))\ this is a true 
constant. Now there is nothing to prevent us from making a guess at 
the latter on the basis of our sample, and this guess would be symbolized 
by 5(5(1,1)), which is no longer a constant but varies with each sample, 
and so has a true variance a(5(5(1,1))) — and so on, ad infinitum. In 
other words, we cannot get away from the fact that, if all we can do 
about 7 is to guess that it equals 1 , then all we can do about its variance 
c(1,1) is to guess it too; likewise all we can do about this last guess is to 
guess again about its true variance, and so on forever. Guess we 
must, stage after stage, unless we have some outside knowledge. 
Only with outside knowledge can the guessing game stop. The game 
is rarely played, however, beyond 5(1,1), (1) because it is quite tedious, 
and (2) because large enough samples give good 1a and as with high 

2.11. The meaning of consistency 

As in our explanation of unbiasedness, let us discuss the parameter 
7 of the model C t = a + yZ t + u t with the understanding that all 


conclusions generalize to all the parameters in the model 
y = a + 7i2i -) h ynzn + u 

Consider any estimating recipe, say, least squares or least cubes. 
Choose a sample of a given size, say, S = 20, and compute *?. Then 
choose another sample containing one more observation (S = 21) 
and compute its ^. Keep doing this, always increasing the sample's 
size. The bigger samples do not have to include any elements of the 
smaller samples — though this becomes inevitable as the big samples 
grow, if the universe is finite. 1 If, as the size of the sample grows, the 
estimates f improve, then we say that the least squares procedure is a 
consistent estimator of 7. Note that 1 does not have to improve in 
each and every step of this process of increasing the size of the sample. 

Improvement in the above paragraph means that the probability 
distributions of 1(S), 1(S -f 1), . . . become more and more pinched 
as the} r straddle the true value of the parameter. 

Digression on notation 

There are two variant notations for consistency. Let y(s) be 
the consistent estimator from a sample of s observations. Let e 
and Tj be two positive numbers, however small. Then there is,, 
some size S for which 

P(\y(s) - 7I < > 1 - v 

if s > S. A shorthand notation for the same thing is 

P lim y(s) = 7 

which is to be read "y(s) converges to 7 in probability/' or 
"probability limit of y(s) is 7." 

Under very weak restrictions, a maximum likelihood estimate is also 
a consistent estimate. Note, however, that, even when the method is 
consistent, there is no guarantee that the estimate will improve every 
time we take a larger sample. It may turn out that our sample of 
size 2 happens to contain points a and b, which give an estimate 
f(2) = 0.4875, and the next larger sample happens to contain points 

1 A sample could, of course, bo infinite without ever including all the elements 
of an infinite universe. 


o, 6, and c, which give an estimate 7(3) » —0.30288, which is much 
worse. Even when the larger sample includes all the points of the 
smaller, as in the example just cited, it can give a worse estimate. 
This is so because the next point drawn, c, may be so atypical as to 
outweigh the previous typical points a and b. 

2.12. The merits of unbiasedness and consistency 

Are the properties of unbiasedness and consistency worth the fuss? 
Remember the fundamental fact that with limited sampling resources 
it is not possible to estimate y correctly every time, even when the 
estimating procedure is unbiased and consistent. 

Because of a small budget, our sample may be so small that ^ has a 
large variance. Even if the sample is large, it may be an unlucky one, 
yielding an extremely wrong estimate. The mistake has happened, 
and it is no consolation to know that, if we had taken all possible 
samples of that size, we would have hit the correct estimate on the 
average. The following complaint is a familiar one from the area of 
Uncertainty Economics: Some people advise me to behave always so as 
to maximize my expected utility; in other words, to make once-in-a- 
lifetime decisions as if I had an eternity to repeat the experiment. 
Well, if I get my head chopped ofT on the first (and necessarily final) 
try, what do I care about the theoretical average consequences of my 
decision? Wherever a comparatively crucial outcome hinges on a 
single correct estimate, unbiasedness is not in itself a desirable property. 

Likewise, it is mockery to tell an unsuccessful econometrician that 
he could have improved his estimate if he had been willing to enlarge 
his sample indefinitely. 

What, then, is the use of unbiasedness and consistency? In them- 
selves they are of no use; they do help, however, in the design of 
samples and as rules for research strategy and communication among 

There is a body of statistical theory — not discussed in this work — 
which tells us how to redesign our sample in order to decrease bias and 
inconsistency to some tolerable level. For example, with infinite 
universe, if we have two parameters to estimate, the theory says that 
a sample must be larger than 100 if consistency is to become " effective 
at the 5 per cent level." Whether we want to take a sample that large 


depends on the use and strategic importance of our estimate as well as 
on the cost of sampling. All this opens up the fields of verification and 
statistical decision, into which we shall not go here. 

Unbiasedness, consistency, and other estimating criteria to be 
introduced below are sometimes conceived of as scientific conventions: 1 

If content to look at the procedure of point estimation unpretentiously as a 
social undertaking, we may therefore state our criterion of preference for a 
method of agreement so conceived in the following terms: 

(i) different observers make at different times observations of one and the 

same thing by one and the same method; 
(ii) individual seta of observations so conceived are independent samples 
of possible observations consistent with a framework of competence, and as 
such wo may tentatively conceptualise the performance of successive 
sets as a stochastic process; 

(iii) we shall then prefer any method of combining constituents of observa- 
tions, if it is such as to ensure a higher probability of agreement 
between successive sets, as the size of the sample enlarges in accordance 
with the assumption that we should thereby reach the true value of the 
unknown quantity in the limit; 

(iv) for a given sample size, we shall also prefer a method of combination 
which guarantees minimum dispersion of Values obtainable by different 
observers within the framework of (i) above. 

In the long run, the convention last stated guarantees that there will be a 
minimum of disagreement between the observations of different observers, if 
they all pursue the same rule consistently. . . . We have undertaken to 
operate within a fixed framework of repetition. This is an assumption which 
is intelligible in the domain of surveying, of astronomy or of experimental 
physics. How far it is meaningful in the domain of biology and whether it is 
ever meaningful in the domain of the social sciences are questions which we 
cannot lightly dismiss by the emotive appeal of the success or usefulness of 
statistical methods in the observatory, in the physical laboratory and in the 
Cartographer's office. 

Philosophers of probability are still debating whether the italics of 
the quotation do in fact define a universe of sampling, whether it can 
be defined apart from the postulate that an Urn of Nature underlies 
everything, and whether the above scientific conventions become 
reasonable only upon our conceding the postulate. 

1 Lancelot Hogbcn, Statistical Theory, pp. 1106-207 (London: George Allen & 
Unwin, Ltd., 1957). Italics added. 


2.13. Other estimating criteria 

So far I have mentioned three estimating criteria, or properties that 
we might desire our estimating procedures to have. These were 
(1) maximum likelihood, (2) unbiasedness, (3) consistency. Some 
others are: 

4. Efficiency 

If y and 1 are two estimators from a sample of S observations, the 
more efficient one has the smaller variance. It is possible to have 
^(7>7) < <r(1rt) f° r some sample sizes and the reverse for other sample 
sizes; or one may be uniformly more efficient than the other; some 
estimators are most efficient, others uniformly most efficient. 

5. Sufficiency 

An estimator from a sample of size S is sufficient if no other estimator 
from the same sample can add any knowledge about the parameter 
being estimated. For instance, to estimate the population mean, the 
sample mean is sufficient and the sample median is not. 

6. The following desirable property has no name. Let o($tf\S) 
shrink more rapidly than <r(y,y\S) as the sample increases. Then i 
is more desirable than y. 

There is no end to the criteria one might invent. Nor are the criteria 
listed mutually exclusive. Indeed, a maximum likelihood estimator 
tends to the normal distribution as the sample increases; it is consistent 
and most efficient for large samples. A maximum likelihood estimator 
from a single-peaked, symmetrically distributed universe is unbiased. 

2.14. Least squares and the criteria 

If all the Simplifying Assumptions are satisfied, the least squares 
method of estimating a and y in single-equation models of the form 

C t - a + yZ t + u t (2-13) 

yields maximum likelihood, unbiased, consistent, efficient, and suf- 
ficient estimates of the parameters. This result can be generalized 


in a variety of directions. The first generalization is that it applies to 
a model of the form 

y(t) - « -f 7i*i(0 -f y%z%{t) + • • • 4- -hMD + u(t) (2-14) 

where y l§ the endogenous variable and the z'a are exogenous variables. 
(Least squares is biased if some of the z's are lagged values of y— this 
question is postponed to the r ext chapter.) 

Lea^t squares yields maximum likelihood, unbiased, consistent, 
sufficient, but inefficient estimates if the variance of u t is not constant 
but varies systematically, either with time or with the magnitude of 
the exogenous variables. Such systematic variation of its variance 
makes u heteroskedastic. 

2.15. Treatment of he teroskedasticity 

We shall confine the discussion of heteroskedasticity to model 
(2-13) on the understanding that it generalizes to (2-14). 

The random term can have a variable variance <r U u(t) for various 

1. People learn, and so their errors of behavior become absolutely 
smaller with time. In this case a(t) decreases. 

2. Income grows, and people now barely discern dollars whereas 
previously they discerned dimes. Here a{t) grows. 

3. As income grows, errors of measurement also grow, because now 
the tax returns, etc., from which C and Z are measured no longer 
report pennies. Here c{t) increases. 

4. Data-collecting techniques improve. v(t) decreases. 
Consider Fig. 5. It shows a sample of three points coming from a 

heteroskedastic universe. 

Since the errors are heteroskedastic, we would, on the average, 
expect observations in range 1 to fall rather near the true regression 
line, observations in range 2 somewhat farther, and in range 3 farther 
still. In any given sample, say, (a,b,c), points b and c should ideally 
be "discounted" according to the greater variances that prevail in 
their ranges. Using the straight sum of squares is the same as failing 
to discount b and c. The result is that sample (a,b,c) gives a larger 
value for 7 than it would if observations had been properly discounted. 

If no allowance is made for the changing variance <r(t), least squares 



fits are maximum likelihood, unbiased, and consistent but inefficient. 
To show inefficiency, consider the likelihood function of (2*4), There, 
the matrix 4 of the covariances of the random term not only was 
diagonal but had equal entries; so it could factor out [lit (i^)] and 
drop out when the likelihood function was maximized with respect to 
7 (and a). It is this fact that made i an efficient estliimta, With 
unequal entries along the diagonal, this is no longer possible, To 
obtain an efficient, unbiased, and consistent estimate of y, we must 
solve a complicated set of equations involving 7, <r(l), * ■ , , *(&)% 
Somewhat less efficient (but more so than minimizing Su 2 ) is, to make 


Range 1 Range 2 

• x 1 

■ ! 

1 Range 3 #c 

1 ^* 

■ 1 





, 1 


,!, , , ■ , . 

4 14 Zg 

Fig. 5. A typical sample from a heteroskedastio univerli. 

(from outside knowledge) approximate guesses about cr(l), * » * , 9 (&) 
and to minimize the sum of squares of appropriately " deflated * # 
residuals (see Exercise 2.G). This, too, is an unbiased and consistent 


2.F Prove that *? = mzc/mzz is unbiased and consistent even 
when u is heteroskedastio. 

2.G Let <f>(s) be an estimate (from outside information) of l/<r««(s). 
Prove that minimizing 2<t>(s)u 2 (s) yields the following estimate of 7: 4 

= (S(0)S0CZ) - (2<}>Z)(2<f>C) 
7W (2<t>)(2ct>Z 2 ) - (2tf>Z) 2 

2.H Prove the unbiasedness and consistency of i (<£). 


Digression on arbitrary weights 

The weights <f>(s) are arbitrary. Is there no danger that the 
denominator of ^(<£) might be (nearly or exactly) zero and blow 
up the proceedings? 

Answer: There is none. 


- £ <fc<fc(Z< - Zy) 2 > 

It is perfectly proper to deflate the heteroskedastic residuals by the 
exogenous variable Z itself and to fit by least squares the homoskedastic 

§ = «;~ + Y + ! (2-15) 

instead of the original heteroskedastic one 

C = a + yZ + u (2-16) 

From (2-15) and (2-16) we obtain numerically different consistent 
and unbiased estimates of a and y. 


2.1 Prove that d(Z) = m(c/z)(i/z)/m ( i/z)(i/z) is unbiased and con- 

Further readings 

Maurice G. Kendall, "On the Method of Maximum Likelihood" (Journal 
of Vie Royal Statistical Society, vol. 103, pt. 3, pp. 389-399, 1940) discusses 
the reasonableness of the method and the concept of likelihood. Whether 
the principle of maximum likelihood is logically wound up with subjective 


belief or inverse probability is still under debate. The intrepid reader who 
leafs through the last 30 or so years of the above Journal will be rewarded 
with. the spectacle of a battle of nimble giants: Bartlett, Fisher, Gini, Jeffreys, 
Kendall, Keynes, Pearson, Yule. 

The algebra of moments is a special application of matrices and vectors. 
Matrices and determinants are explained in the appendixes ©f Klein apd 
Tintner. Allen devotes two chapters (12 and 13) to all the vector, matrl^ 
and determinant theory an economist is ever likely to need. 

The estimating criteria of unbiascdness, consistency, etc., are clearly stilted 
and briefly discussed in the first dozen pages of Kendall's second volume, 
and debunked by Hogben in the reference cited in the text. 

The reason for using m^/Sm tt as an estimate of o-(f ,f ), the formula for 
estimating cov (&,i) and cr((2,d), and the extensions of these formula! for 
several y variables are stated and rationalized (in my opinion, not too con- 
vincingly) by Klein, pp. 133-137. 


Bias in models of decay 

3.1. Introduction and summary 

This chapter is tedious and not crucial; it can be skipped without 
great loss. I wrote it for two reasons: to develop the concept of 
conjugate samples, and to show what I have claimed in the Preface: 
that common-sense interpretations of intricate theorems in mathe- 
matical statistics can be found. 

The main proposition of this chapter is that a single-equation model 
of the form 

Vi - V(t) - « + 7i*i(0 +•■•• + VHZ H (Jt) + u(t) (3-1) 

in which some of the z's are not exogenous variables but rather lagged 
values of y itself, necessarily violates Simplifying Assumption 6, and 
hence that maximum likelihood estimates of a, 71, ... , 7# are 

The concept of conjugate samples gives a handy and simple-minded 
but entirely rigorous way to test for bias. It will be used again and 
again in later chapters for models much more complicated than (3-1). 

Equations involving lags of an endogenous variable are called 



Most satisfactory dynamic econometric models are multivariate 
autoregressive systems, in other words, elaborate versions of (3-1), 
and share its pitfalls in estimation. We shall see that the character of 
initial conditions affects vitally our estimating procedure and that, 
unfortunately, in econometrics the initial conditions are not favorable 
to estimation, though in the experimental sciences they commonly are. 

If the initial condition y(0) is a fixed constant Y, the maximum likeli- 
hood criterion leads to least squares regression of y(t) on y(t — 1), 
and the resulting estimate for 7 is biased, except for samples of size 1. 

If y(0) is a random variable, independent of u, then the maximum 
likelihood criterion does not lead to least squares. If least squares are 
used in this instance, they lead to biased estimates, again with the 
exception of samples of size 1. 


The size S of the sample is given in units that correspond to the 
number of points through which a line is fitted. Thus, if we observe 
only 2/3 and 2/2, this is a sample of one; S = 1. If we observe 
2/4, 2/3, and 2/2, this makes a sample of two points, S « 2, and so on. In 
his proof of this theorem, Hurwicz (in Koopmans, chap. 15) would call 
these, respectively, samples of size T = 2 and T = 3. The difference 
is important when observations have gaps (are not consecutive). We 
shall confine ourselves to consecutive samples. Appendix D deals 
with the general case. 

3.2. Violation of Simplifying Assumption 6 

A lagged variable, unlike an exogenous variable, cannot be inde- 
pendent of the random component of the model. In (3-1) a lagged 
value of y is necessarily correlated with some past value of u, because 
2/(0 and u(t) are clearly correlated. Therefore, the very specification 
of (3-1) rules out Simplifying Assumption 6. 

But why worry about such models? Because (3-1) and its generali- 
zations express in linear form oscillations, decay, and explosions, which 
are all of great interest and which are, indeed, the bread and butter of 
physics, astronomy, and economics. For instance, springs behave 
substantially like 

2/(0 = a + yi y(t - 1) + y 2 y(t - 2) +u(t) 


and radioactive decay and pendulums like 

y(t) - yy(t - 1) + «(0 (3-2) 

Business cycles are more complicated, involving several equations 
like (3-1). 

Why do we want unbiased estimates? There are excellent reasons. 
If the world responds to our actions with some delay or if we respond 
with delay to the world, in order to act correctly we need to know the 
parameters accurately. How hot the water in the shower is now 
depends on how far I had turned the tap some seconds ago. If my 
estimate of the parameter expressing the response of water temperature 
to a turn of the tap is biased, this means that I freeze or get scalded 
or that I alternate between these two states, and, in any event, that 
I reach a comfortable temperature much later than I would with an 
unbiased estimate. 

In economics, consumers, businesses, and governments act like a 
man in a shower. The information they get about prices, sales, 
orders, or national income comes with some delay and reflects the 
water temperature at the tap some time ago. Moreover, it takes time 
to decide and to put decisions into effect. If the decision makers 
have misjudged how strong are the natural damping properties of the 
economy, decisions and policy will either overshoot or undershoot the 
mark, or alternate between overshooting and undershooting it, and 
will cause uncomfortable and unnecessary oscillations in economic 

Our discussion will now be confined to the simplest possible case 
(3-2). Let consumption this year y(t) depend on consumption last 
year y(t — 1), as in (3-2). If the relationship involved a constant 
term a, we eliminate a by measuring y not from the origin but from its 
equilibrium value. I shall illustrate my argument by a concrete 
example where the true y has the convenient value 0.5 and where the 
initial value Y is fixed and equal to 24. 

In Fig. 6, line OP represents the exact relationship y t = 0.5y t -i. 

3,3. Conjugate samples 

In model (3-1) witli fixed initial conditions, We can describe a sample 
completely by mentioning two things: (1) what time periods it includes 



and (2) what values the disturbances took on in those periods. For 
example, (a,b,c,d) in Fig. 6 is completely described by 

[ 8 - 1, 2, 3, 4] 
[u. = 4, 0, 0, OJ 

(a' t b f ,c' ,d f ) is described by 

[s= 1, 2, 3, 4] 
[u. = -4, 0, 0, OJ 

and (a',&',d',e) by 

s - 1, 2, 4, 5 
w. = -4, 0, 0, 

If w is symmetrically distributed, all conceivable samples of size S 
that one can draw from the universe can be arranged in conjugate sets. 
We shall see that in each conjugate set the maximum likelihood estimates 












<< d . i i r 

2 4 6 8 10 ■ 12 14 16 18 20 22 24 26 y,_ x 
Fig. 6. Conjugate disturbances. 

of 7 average to less than the true value of y and, therefore, that 
maximum likelihood estimates are biased for all samples of size S. 

These propositions need to be qualified if S = 1 or if 7 is not between 
and 1; they are proved if u(t) is normally distributed, but only 
conjectured if u(t) has some other symmetrical distribution. 

For an introduction to the concept of conjugate samples, consider 



Fig. 6, which depicts two of the many possible courses that events can 
follow under our assumptions that y — 0.5 and Y = 24. One course 
is represented by the points a, 6, c, d, e, . . . ; the other by a', &', c', 
d', . . . . In the first course, the disturbance is equal to +4 in period 1 
and zero thereafter. In the second course, it is —4 in period 1 and zero 
thereafter. The samples S(+) = (a,b,c,d) and S( — ) = (a f ,b' ,c' ,d') 
are conjugate samples, and form a conjugate set. Similarly, (a,b,c) and 
(a',6',c') form a conjugate set. 

To be conjugate, two samples must be drawn from the same time 
span s = 1, 2, . . . , S; and the disturbances u a that contributed to 
corresponding observations must have the same absolute value in the 
two samples. This definition is for consecutive samples only. Appen- 
dix D extends it to the nonconsecutive case. 

Thus, sample 


forms a conjugate set all by itself. 

has as its conjugate 


s m 3, 4, 5, 6 
u t = 0, 0, 17, 

[ • - 3, 4, 5, 6] 

[u, = 0, 0, -17, OJ 

f • - 4, 5, 6, 71 

[u, - 0, 1, 0, -9J 

has three conjugates, the following: 

[ « = 4, 5, 6, 7] f.f-4, 5, 6, 7] 
[u. = 0, -1, 0, -9 J [w,«0, 1, 0, 9 J 

s = 4, 5, 6, 7 
u, = 0, -1, 0, 9 


The greatest conjugate set of samples of size S is 2*, where k (0 < 
k < S) represents the number of nonzero disturbances. If S » 4, 
the largest conjugate set contains 16 samples. 

3.4. Source of bias 

In Fig. 6, line OR with slope ^[S(+)] = 0.6053 is the least squares 
regression through the origin, and sample S(+) = (a,&,c,e£); and OR' 
with slope t[S(— )] = 0.3545 is the same for the conjugate (a',b',c\d'). 
The line OR overestimates y because OR is pulled up by point a. The 
line OR' underestimates y because of the downward pull of point a'. 
As we have l A (0.6053 + 0.3545) = 0.4799 < y, the downward pull 
is the stronger. But why? Because point a is accompanied by 6, c, d, 
and a! by b' f c', d'. The primed points b' t c' f d' are closer to the origin 
than the corresponding unprimed points; hence, their "leverage" on 
their least squares line OR' is weaker than the leverago of the unprimed 
points on theirs (line OR). It is impossible for a' to be accompanied 
by b, c, d, because all future periods must necessarily inherit whatever 
impulse was first imparted by the random term of period 1. Points &', 
c ; , d' inherit a negative impulse, and points 6, c, d inherit a positive 

Another way of stating this is by referring to (3-1). In (3-1) one 
of the z's (say, z 4 ) is a lagged value of y (say, the lag is 2 time periods). 
It follows that Zi(t) is correlated with the past value of the disturbance 
u(t — 2), since y(t) is clearly correlated with u(t). 

All the proofs of bias later in this chapter and in Appendix D are 
merely fancy versions of what I have just shown for this special case. 
When conjugate sets are large, arguments from the geometry of Fig. 6, 
though perfectly possible, become confusing, and so we turn to 

With fixed initial condition 2/(0) = F, the maximum likelihood esti- 
mate of 7 is the least squares estimate 

^ V'V-i 

1 - ^ (3-3) 





3.5. Extent of the bias 

From (3-3) and (3-2), 

1 - 7 + -^ (3-4) 

We write the above fraction iV//>. We shall see that the bias N/D 
varies with the true value of 7, the size of the sample, and the size 
of the initial value F. For instance, in small samples it is almost 
25 per cent; in samples of 20 observations, it is about 10 per cent of 
the true value of 7. It never disappears, no matter what value true 7 
may have or how large a sample one takes. 

By applying (3-2) repeatedly and letting P, Q, and R stand for 
polynomials, we get 

2V = 

(wi + 7W2 4- y 2 u z -f 

• • 

• + 

yB-hisW + P(u u . . . 

,u 8 ) 

D = 

(1 + 7 2 4- 7 4 + • • • 



-«)F« + YQ(y,u h . . . 

,u s ~ 


+ R(u h 

. . . 



By considering N/D, one can establish that the bias is aggravated the 
further 7 is from 4-1 or — 1 and the smaller the sample. Bias exists 
even when 7 = ±1 or when 7 = 0; the latter is truly remarkable, 
since the model is then reduced to y(t) — u(t). Since 2V is a linear 
function of Y and the always positive denominator D is a quadratic 
function of F, the bias N/D can be quite large for certain ranges of F. 
The above results generalize to model (3-1), although it is not easy 
to say whether the bias is up or down. 

3.6. The nature of initial conditions 

The following fantastically artificial example illustrates the concept 
of conjugate samples and what it means for initial conditions 2/(0) to 
be random or fixed. 

An outfit that runs automatic cafeterias has its customers use, 


instead of coins, special tokens made of copper. The company has 
several cafeterias across the country, but its customers rarely think 
of taking their leftover tokens with them when they travel OP move 
from city to city. As there is at most one cafeteria per city, each 
cafeteria's tokens are like independent, closed monetary systems. 
Let us look at a single cafeteria of this kind. 

Originally it had coined a number of brand-new tokens and put 
them in circulation, using y(0) pounds of copper. Thereafter, the 
amount of copper in the tokens is subject to two influences. (1) To 
begin with, the tokens wear out as they are used. The velocity of 
token circulation is equal in all cities, and customers' pockets, hands, 
keys, and other objects that rub against the tokens are equally abra- 
sive in all cities. Thus, in each city, year t inherits only a part 7 
(0 < 7 < 1) of the coppor circulating in the previous year* (2) In 
addition to the systematic factor of wear and tear, random influences 
are at play. First, some customer's child now and then swallows a 
token; this disappears utterly from circulation into the city's lowers. 
However, occasionally there is an opposite tendency. An amateur 
but successful counterfeiter mints his own token now and then, or a 
lost token is found inside a fish and put back into circulation. So 
the copper remaining in circulation is described by tha Itochastic 
model (3-2). The problem for the company is how to estimate the 
true survival rate of its tokens. 

It is very important to interpret correctly our first assumption that 
"u(t) is a random variable in each time period t." It means that u(t) 
is capable of assuming at least two values (opposites, if u m sym- 
metrical) in the same period of time. But how can it? Here we 
need a concept of conjugate cities analogous to conjugate samples. 
Imagine that the only positive disturbances come from one counter- 
feiter and that the only negative disturbances come from one child, 
the counterfeiter's child, who swallows tokens. The counterfeiter is 
divorced, the child was awarded to the mother, and the two parents 
always live in separate cities, say, Ames and Buffalo; but who lives 
where in year t is decided at random. Ames and Buffalo are conjugate 
cities, because, when one experiences counterfeiting, +u(t) t the other 
necessarily experiences swallowing, — u(t). If there were more families 
like this one, the set of conjugate cities would have to expand enough 


to accommodate all permutations of the various values that ±u(t) 
is capable of assuming. 

We have fixed initial conditions if each cafeteria starts with the 
same poundage, and random initial conditions when the initial pound- 
age is a random variable. To estimate the token survival rate, 
different procedures should be used in the two cases. 

3.7. Unbiased estimation 

Unbiased estimation of 7 is possible only if the initial copper endow- 
ment is a fixed constant Y. The only unbiased estimate is given by 
the ratio of the first two successive ?/'s using data from a single city: 

~ _ 2/(1) _ 2/(1) ,0 KN 

7 " W) " ~Y- (3 " 5) 

which is a degenerate least squares estimate. 

This result is really startling. It says that we must throw out any 
information we may have about copper supply anywhere, except in 
year and year 1 in, say, Ames. Unless we do this we can never 
hope to get an unbiased estimate. Estimating 7 without bias when 
each city starts with a different amount of copper is an impossible 
task. A complete census of copper in all cities (i) in two successive 
years would give the correct (not just unbiased) estimate 

- T I -!»_ = y 

2 2/.<0 

y m i 

I y<« - i) ' I *» ~ 1) 

i i 

We can draw another fascinating conclusion: If we have the bad 
luck to start off with different endowments, we can never get an 
unbiased estimate of 7. But suppose we find that the endowments 
of all cities happen to be equal later, say, in period t — 1. Then all 
we have to do is wait for the next year, measure the copper of any 
one city, say, Buffalo, and compute the ratio 

? - ^- (3-6) 



which is an unbiased estimate. (Where the information would mine 
from, that all cities have an equal token supply in y©ar i — 1, is 
another matter.) 

The experimental scientist is, however, free from such predicaments. 
If he thinks radium decays as in (3-2), then he can make initial con- 
ditions equal by putting aside in several boxes equal lumps of radium. 
Then he can let them decay for a year, remeasure them, apply (3-6) 
to the contents of any one box, and average the results. Any One box 
gives an unbiased estimate. Averaging the contents of several boxes 
gives an estimate that is efficient as well as unbiased. 

The econometrician cannot control his initial conditions in this way. 
If he wants an unbiased estimate, he must throw away, as prescribed, 
most of his information, use formula (3-6), and thus get an unbiased 
and inefficient estimate. Or else he may decide that he wants to 
reduce the variance of the estimate at the cost of introducing some 
bias; then he will use a formula like that of Exercise 3.C below or 
some more complicated version of it. 

Autoregressive equations are related to the moving average, a tech- 
nique commonly employed to interpolate data, to estimate trends, 
and to isolate cyclical components of time series. The statistical 
pitfalls of estimating (3-2) plague time series analysis, and they are 
not the only pitfalls. The last chapter of this book returns to some 
of these problems. 


3.A Prove that (3-5) is unbiased. 
3.B Prove that y = y(2)/y(l) is biased. 
3.C Prove that y = [y{2) + y(l)]/[y(l) + Y] is biased, 
3.1) Let u t in (3-2) have the symmetrical distribution q(u f ) with 
finite variance. "Symmetrical" means q(u t ) = q( — u t ). Then the 

likelihood function of a random consecutive sample is {f q(u t ). 

Prove that the maximum likelihood estimate of y is obtained by 
minimizing the expression Y log g(^,), where the ft, are the vertical 


deviations from the line that we are seeking. 
3.E By the method of conjugate samples or by any other method, 


prove or disprove the conjecture that the estimate of Exercise 3.D is 

Further readings 

The reader who wants to see for himself how intricate is the statistical 
theory of even the simplest possible lagged model (3-2) may look up "Least- 
Squares Bias in Time Series," by Leonid Hurwicz, chap. 15 of Koopmans, 
pp. 3G5-383. Tintner, pp. 255-260, gives examples and shows additional 


Pitfalls of simultaneous 

4.1. Simultaneous interdependence 

"Everything depends on everything else" is the theme song of the 
Economic and the Celestial Spheres. It means that several con- 
temporaneous endogenous variables hang from one another by means 
of several distinct causal strings. Thus, there are two causal (more 
politely, functional) relations between aggregate consumption and 
aggregate income : Since people are one another's customers, consump- 
tion causes income, and, since people work to eat, income causes 
consumption. The two relationships are, respectively, the national 
income identity in its simplest form 

Vt = c t (4-1) 

and the (unlagged) stochastic consumption function in its simplest 

c t - a + py t + u t (4-2) 

Wo can imagine that causal forces flow from the right to the loft of 
the two equality signs. 



The moral of this chapter is that, if endogenous variables, like c 
and y, are connected in several ways, like (4-1) and (4-2), every 
statistical procedure that ignores even one of the ways is bound to be 
wrong. The statistical procedure must reflect the economic inter- 

4.2. Exogenous variables 

I shall not vouch for the Heavens, but in economics there are such 
things as exogenous variables. A variable exogenous to the economic 
sphere is a variable, like an earthquake, that influences some economic 
variables, like rents and food prices, without being influenced back. 
The random term u is, ideally, exogenous — though in practice it is a 
catchall for all unknown or unspecified influences, exogenous or endoge- 
nous. One thing is certain: Earthquakes and such are not influenced 
by disturbances in consumption. Indeed, the definition of an exoge- 
nous variable is that it has no connection with the random component 
of an economic relationship. 

My prototype exogenous variable, investment z, is not really 
exogenous to the economic system, especially in the long run, but we 
shall bow to tradition and convenience for the sake of exposition. 

4.3. Haavelmo's proposition 

The models in this chapter, like the single-equation models treated 
m far, (1) are linear and (2) have all the Simplifying Properties. 
Therefore, they are subject to all the pitfalls I have pointed out so 
far. Unlike the models of Chaps. 1 to 3, the new models each contain 
at least two equations. Most of my examples will have precisely 
two (and not three or four) for convenience only, since the results 
can easily be extended. 

New kinds of complication arise when a second equation is added. 

1. The identification problem 

It is sometimes impossible to estimate the parameters — this problem 
is side-stepped until Chap. 6. 


2. The Haavelmo 1 problem 

The intuitively obvious way of estimating the parameters of a 
two-equation model is wrong, even in the simplest of cases, where one 
of the equations is an identity. We shall see that pedestrian methods 
are unable to estimate correctly the marginal propensity to consume 
out of current income, no matter how many years of income and 
consumption data we may have. Even infinite samples overestimate 
the marginal propensity to consume. This difficulty is as strategic 
as it sounds incredible. It means that the multiplier gets overesti- 
mated and, hence, that counterdepression policies will undershoot full 
employment and counterinflation policies will be too timid. Because 
of bad statistical procedures, the cure of unemployment or inflation 
comes too slowly. 

The model is as follows: 

c t = a ■+■ $y t -j- u t (consumption function) (4-2) 
c t + zt — Vt (income identity) (4-3) 

where z t (investment) is exogenous, and u t has all the Simplifying 
Properties. 2 We shall illustrate by assuming the convenient values 
a = 5, = 0.5. 

In Fig. 7, line FG represents the true relation c t — 5 + 0.5y t . When 
the random disturbance is positive, the line moves up; with negative 
disturbance, it moves down. Lines HJ and KL correspond, respec- 
tively, to random errors equal to +2 and —2. OQ f the 45° line 
through the origin, represents equation (4-3) for the special case in 
which investment z is zero. In the years when investment is zero, 
the only combinations of income and consumption we could possibly 
observe will have to lie on OQ, because nowhere else can there be 
equilibrium. If, for instance, in years 1900 and 1917 investment had 

1 For reference to Haavelmo, see Further Readings at the end of this chapter. 

2 To be specific, Assumption 6 in this case requires that u and z shall not influence 
each other, either in the same time period or with a lag. But the random term u 
cannot be independent of y. The reason is that a and are constants, z is fixed 
outside the economic sphere, and u comes, so to speak, from a table of random 
numbers; if this is so, then, by equations (4-2) and (4-3), a, /?, z, ana u necessarily 
determine y (and c). Thus variable y is not predetermined but codetermined 
with c. These statements summarize and anticipate the remainder of the chapter. 



been zero and if the errors had been +2 and —2, respectively, then 
points P and P' would have been observed. 

Let us now suppose that in some years investment z t equals 3. 
Line MN (also 45° steep) describes the situation, which is that 
c t + 3 » yt. With errors u t = ±2, the observable points are at R 
and R'. With errors ranging from —2 to +2, all observable points 
fall between R and R'. 

Fig. 7. The Haavelmo bias. 

Let us now pass a least-squares-regression line through a scatter 
diagram of income and consumption, minimizing squares in the vertical 
sense and arguing that from the point of view of the consumption function 
Income causes consumption, not vice versa. Such a procedure is 
bound to overestimate the slope p of the consumption function and 
to underestimate its intercept a. This is Haavelmo' s proposition. 

The least squares line (in dashes) corresponds to observation points 
that lie in the area PP'R'R. It is tilted counterclockwise relative to 
the true line FG because of the pull of "extreme" points in the corners 
next to R and P'. The less investment z ranges and the bigger the 
stochastic errors u are, the stronger is the counterclockwise pull, 
because lines PP' and RR' fall closer together. 

This overestimating of /3 persists even if we allow investment to 


range very far* Though it is true that the parallelogram PP'R'R gets 
longer and longer toward the northeast (say, it becomes PW'P') the 
fact remains that V and P', the extreme corners, help to tilt the least 
squares line upward. This suggests that perhaps we ought to mioi- 
mize squares not in a vertical direction but in a direction running from 
southwest to northeast. In this particular case (though not generally) 
diagonal least squares are precisely correct and equivalent to the 
procedure of simultaneous estimation described in the following section. 

4.4. Simultaneous estimation 

We know that two relations, not one, account for the slanted posi- 
tion of the universe points in Fig. 7. Had the consumption function 
been at work alone, a given income change Ay would result in a change 
in consumption Ac = Ay. Had the income identity been at work 
alone, then to the same change in income would correspond a larger 
change in consumption Ac = Ay. In fact, both relations are at work. 
Therefore, the total manifest response of consumption to income is 
neither Ac = Ay nor Ac = Ay } but something in between. This is 
why the line in dashes is steeper than FG (and less steep than OQ). 

In order to isolate the 0. effect from a sample of points like PP'R'R, 
both relations must be allowed for. This is done by rewriting the 
model : 

01 I P I U t /A M\ 

v* - r=Tf + T=^ Zt + n^ <"> 

The term u t /(l — 0) has the same properties as u t except that it 
has a different variance. Therefore the error term in the new model 
has all the Simplifying Properties. Either of the new equations con- 
stitutes a single-equation model with one endogenous variable (c and y, 
respectively) and one independent variable (z in both cases). There- 
fore, the estimating techniques of Sec. 2.5 can be applied to the sophisti- 
cated parameters a' = «/(l - 0), 71 = 0/(1 - 0). 72 = 1/(1 - 0). 
Denote these estimates by the hat ( A ). For the naive least squares 
estimate of a and 0, derived from regressing c on y, use the bird ( v ). 
Let us now express these estimates in terms of moments, and let us do 
it for 0, 71, and 72 only, leaving aside a and a'. 


a _ ■ ft _. Mc. 
1 - j§ ™*« 

1 - m„ 

| Is a biased estimate of ft, because 

P 33! "i i . . SSI — " ess n -r» — ■ ■ 

lUyy TJfiyy TTly V 

and it is known that 

/3 is inconsistent, because 

e 5bs ^ o (4-6) 

a W(«-h^+«)(>+a-f«) ^ ft + (1 + j3)(m u< /m„) 4- Tn uu /m g g 
^d+a+u)(i+a+u) 1 + 2m ut /m zt + m uu /m tt 

The various moments m cv , m uu , etc., vary in value, of course, from 
sample to sample. As the sample size approaches the population 
size, however, m ul approaches cov (u,z) = 0, m uu approaches var u > 0, 
and m„ approaches var z > 0. Therefore, 

Pl im ^ = fl + vartt/vars 
1 + var w/var z 


4. A In similar fashion prove that Plim a < a. 
4.B Interpret (4-6). 

4.C Show that 1/(1 — /§) is a biased estimate of 1/(1 — ft). 
Hint: manipulate the expression 

1 1 

1 — $ l — m cy /m vv 

and use the fact that e(m v ,/m„) = 1/(1 - ft). 

4.D Prove that fi is an unbiased and consistent estimate of 
0/(1 - 0). 



4.E Prove that ^ 2 is an unbiased and consistent estimate of 

1/(1 - fl. 

4.F Prove that 1 1 and f 2 yield a single compatible estimate of 0, 
which we call 0; $ ^m et /m yy . 

4.G Prove that $ is a biased but consistent estimate of #. 

4.H From the facts that /§ « -f m u ,/rn yv and that # = $ + m uy /m nt 
argue that the bias of $ is less serious than the bias of #. 

Digression on directional least squares 

What do we get if, in Fig. 7, we minimize the sum of th© square 
deviations not vertically but from the southwest to the n§?thP&st? 
Let P(y,c) in Fig. 8 stand for any point of the sample; p ® tZ is 






%*>^M / 



y t 

Fig. 8. Directional least squares. 

parallel to the 45° line. Let be the angle of inclination of the 
true consumption function; that is, let tan 6 be the slope of the 
curve a + Py = 0. Then in triangle PZM, from the law of sines, 
we have 


sin <f> 
from which it follows that 

sin (90 + 6) 


V2 V2 , 






2j P ' " (1 - fl )l 2/ M ' " (1 Z 0)2 m (o-«-/» V )(o-«HJv) 

Setting c = */ — *, 

J Pi = ( j 1)2 l«W. + tf - !) 2 ^ + 208 - 1)«J 
Minimizing Spj with respect to — 1, we obtain 

1 _ m yt 

1 — m M 
that is to say, the same expression that we found for *? 2 . 

4.5. Generalization of the results 

Section 4.3 showed the pitfalls of ignoring the income identity in 
estimating the consumption function, and Sec. 4.4 showed how to get 
around this difficulty by the technique of simultaneous estimation, 
which takes into account the entire model even though the investigator 
may be interested in only a part. Chapters 5 to 9 deal with the 
intricacies of simultaneous estimation and various approximations 

To prepare the way, let us enlarge the model slightly, by making 
Investment respond to income. The new model is 

c t = a + 0y, + u t (4-2) 

it "- Zi + 7 + tyt + V t (4-7) 

c t + it = Vt (4-8) 

where z t is autonomous investment; i t is total investment; and u t} v t are 
random disturbances independent of each other and of present and 
past values of z. The last sentence is a statement of Simplifying 
Assumption 7, which will be explained and justified in the next chapter. 
Letting s t — yt — c t stand for saving, we obtain from (4-2) the 
saving function 

8 t - -a + (1 - fi)y t - n (4-9) 

Figure 9 shows saving SS and investment 77 as functions of income, 
with zero disturbances (thick lines) and disturbed by ±u f ±v, respec- 
tively (thin lines), and with the usual (stable) relative slopes. 



Naive least squares applied to Fig. 9 underestimate the slop© I — 
of SS (as it underestimates the slope of OQ in Fig. 7) and, hence, again 
overestimates the marginal propensity to consume. 

The Haavelmo bias again. 

4.6. Bias in the secular consumption function 

We have shown that naive curve fitting overestimates the slope of 
the consumption function, even with large samples and whether or not 
investment is a function of income. Statistical fits of the secular 
consumption function give a slope varying from over 0.95 to nearly 
1.0, contradicting the lower figures given by budget studies, introspec- 
tion, and Keynes's hunch. To reconcile these facts, consumption 
theories of imitation, irreversible behavior, and more and more 
explanatory variables have been invoked. A large part of what these 
ingenious theories account for can be explained by Haavelmo's 

Further readings 

Trygve Haavelmo's proposition was, apparently, stated first in "The 
Statistical Implications of a System of Simultaneous Equations" (Econo- 
melrica, vol. 11, no. 1, pp. 1-12, January, 1943), but a later article of his 
applying the proposition to the consumption function has attracted far more 
attention. This has appeared in three places: Trygve Haavelmo, "Methods 
of Measuring the Marginal Propensity to Consume" (Journal of the American 
Statistical Association, vol. 42, no. 237, pp. 105-122, March, 1947); reprinted 


as Cowles Commission Paper 22, new series; and again as chap. 4 of Hood, 
pp. 75-91. Haavelmo gives numerical results and confidence intervals for 
the parameter estimates. 

Jean Bronfenbrenner, "Sources and Size of Least-squares Bias in a Two- 
equation Model," chap. 9 of Hood, pp. 221-235,. extends Haavelmo's propo- 
sition to three more special cases. An early article by Lawrence R. Klein, 
"A Post-mortem on Transition Predictions of National Product" {Journal of 
Political Economy, vol. 54, no. 4, pp. 289-308, August, 1940), puts the 
Haavelmo proposition in proper perspective, as indicating only one of the 
many sources of malestimation. 

Milton Friedman, A Theory of the Consumption Function (New. York: 
National Bureau of Economic Research, 1957), also compares and discusses 
rival measurements of consumption, but his main concern is to test the 
Permanent Income hypothesis and to refine the consumption functions, not to 
discuss econometric pitfalls. It contains valuable references to the literature 
of the consumption function. 

According to Guy H. Orcutt, " Measurement of Price Elasticities in Inter- 
national Trade" {Review of Economics and Statistics, vol. 32, no. 2, pp. 117-132, 
May, 1950), Haavelmo's proposition explains why exchange devaluation had 
been underrated as a cure to balancc-of-payments difficulties. Orcutt con- 
fines mathematics to appendixes and gives many further references. 

Tjalling C. Koopmans in "Statistical Estimation of Simultaneous Economic 
Relations" {Journal of the American Statistical Association, vol. 40, no. 232, 
pt. 1, pp. 448-4G6, December, 1945), discusses the Haavelmo proposition with 
the help of a supply-and-dcmand example and with interesting historical 
comments. When the random disturbances are viewed not as errors of 
observation clinging to specific variables but as errors of the econometric 
relationship itself, then they affect all simultaneous endogenous variables 
symmetrically, and Haavelmo's problem rears its head. The Koopmans 
article is a good preview of the next chapter. 


Many- equation linear models 

5.1. Outline of the chapter 

The moral of Chap. 4 is this: if a model has two equations they 
cannot be estimated one at a time, each without regard for the other, 
because both take part together in generating the phenomena from 
which we draw samples. This fact rules out, except in special cases, 
the use of the pedestrinr technique of naive least squares. Both the 
moral and the reasom ehind it remain in force as the number of 
equations in the model 1. ^ 

The present chapter is rathe nimportant, and might be skipped or 
skimmed at first. All its principles are implicit in Chap. 4. 

The main task of Chap. 5 is to systematize the study of many- 
equation linear models. First we present some standard and effort- 
saving notation (Sec. 5.2). Next, we review the Simplifying Assump- 
tions, which were originally introduced for one-equation models in 
Chap. 1, to see precisely how they extend to the general case (Sec. 5.3). 
With two or more equations, a seventh Simplifying Assumption is 
required, that of stochastic independence among the equations (Sec. 5.4). 

The presence of several simultaneous equations in a model compli- 
cates the likelihood function with the term det J, which we have 



ignored until now; in intricate fashion det J involves the parameters of 
all equations in the system. (The last proposition merely restates the 
moral of Chap. 4.) The digression on Jacobians explains what det 
J is doing in the likelihood function. 

If we heed the moral to the letter and take det J into account, we 
get into awfully long computations (see Sec. 5.5) in spite of all our 
original Simplifying Assumptions. 

Whether computations are long or short, it pays to lay them out in 
an orderly way. This is a general precept, of course, but its value 
stands out most dramatically in the present chapter. It pays not only 
to do computations in an orderly manner but also to perform some 
redundant ones just in case you might want to check some alternative. 
Econometricians normally settle down to a specific model only 
after much experimentation. And, further, redundant computations 
become necessary when we want to estimate a given promising model 
by increasingly refined techniques. The wisdom of performing the 
redundant computations will become fully apparent only after we have 
dealt with ovcridentified systems, instrumental variables, limited 
information, and Theirs method (Chaps. 6 to 9). 

5.2. Effort-saving notation 

It pays to establish once and for all a uniform notation for complete 
linear models of several equations. These are conventions, not 

The endogenous variables are denoted by y'a. There are G endoge- 
nous variables, called y\ y yt t ■»••■$ Vo an( ^> collectively, y. y is the 
vector (2/1,2/2, . • . ,2/c). We use g (g « 1, 2, . . . , G) as running 
subscript for endogenous variables. 

The exogenous variables are denoted by z'q. There are H exoge- 
nous variables, called Zi, z% % . . . % zn f and z is their vector. These 
may be lagged values of the y'a only by special mention. The running 
subscript of an exogenous variablo is h «■ 1, 2, . . . , H. 

All definitions have been solved out of the system, so that there are 
exactly G equations, all stochastic, with errors U\, u%, . . . , uq. 
u = (ui f u 2t - . . . ,u G ). We speak of the gth equation. 

The coefficients of the y'a are called /3s, and those of the z'a are called 



7s. They bear two subscripts: the first refers to the equation, the 
second to the variable to which the parameter corresponds. 

We get rid of the constant term (if any) by letting the last exogenous 
variable zh be identically equal to 1 ; its parameter y H then becomes 
the constant term. In most applications we shall not bother to write 
the constant term at all. Either it is in the last term y HZH = 1, or it 
has been eliminated by measuring all variables from their means. 

B and T represent the matrices of coefficients in their natural order: 

B - 





-001 002 



r = 

7n 7i2 

721 722 

-7oi 702 

71//" 1 



A stands for 

B is always square and of size G XG;T is of sizeCr X H. 
the elements of B and r set side by side: 

A = [Br] 

that is to say, for the matrix of all coefficients in the model, whether 
they belong to endogenous or exogenous variables. A is of size 
GX(G + H). 

x stands for the elements of y and z set side by side. 

x = (2/1,2/2, 

,2/o J Zi,Z2, 


that is to say, x is the vector of all variables, whether endogenous or 
exogenous, but in their natural order. 

ai stands for the first row of A, on for the second row, etc. ; similarly, 
for gi, 5a, ... , 5o, 71,72, • . • , Yo- That is, a lower-case bold Greek 
letter with a single subscript g represents (some of) the parameters of a 
single equation (the oth) of the system. 

We reduce the number of parameters to bo estimated by dividing 
each equation by one of its coefficients. This does not affect the 
model in any other way. We use the pth coefficient of the oth equation 
for this, so that W = L Henceforth we shall always take matrix 5 
in its ''standardized form" 



B = 

' 1 0u 

021 1 


-001 0(72 ' ' ' 1 - 

A model can be written in a variety of forms: 
1. Explicitly, as below (time subscripts omitted) : 

yi+ Pay % + 

02i2/i + y% + 

+ 01(72/(7 + 711*1 + 71222 + 
+ 0202/0 + 72l2l + 722^2 + 

4- yiHZn = ui 
+ ImZn - u 2 

0(712/1 + 0(722/2 +••••+ Jto + 701«1 + 7(72«2 + 

2. In extended vector form) 

?iy + yi z = w i 

?2y + Y2Z = u z 

5oy 4- yoz ■ wo 
3. In condensed vector form: 

ait = Wi 

at* = Wa 

+ 7off*# ■ Wo 



€10% — Uq 

4. In extended matrix form: 

By + Tz = u 

5. In condensed matrix form: 

Ax = u 



Note that, when the context is clear, bold lower-case letters stand 
either for a row or for a column vector. S 

Finally, <r gh (t) stands for the co variance of u g (t) with u h (t) ; #&(£) is 
the matrix of these co variances; and 6° h (t) is its inverse. 

5.3. The Six Simplifying Assumptions generalized 

A laconic mathematician can generalize the Six Simplifying Assump- 
tions with a stroke of brevity by saying that they continue to apply if 
we replace the symbol u by u. Our task is to interpret this in terms of 

In Chap. 1, I discussed the Six Simplifying Properties when there 

was a single equation in the model and, therefore, a single disturbance. 

Now we have one disturbance for each equation, and u is the vector 

made up of them, u(t) = (wi(0,W2(0> • • • » w o(0)» 

Assumption 1 

"u is a random variable" means that each u (t) is a random variable, 
that is to say, that all equations remaining after solving out the 

definitions are stochastic. 


Assumption 2 

"u has expected value 0" means that the mean of the joint distribu- 
tion is the vector = (0,0, . . . ,0), or that each u has zero expected 

Assumption 3 
"u has constant variance" means that the covariances 
(T 0h = cov (u 0} u h ) 
of the several disturbances do not vary with time. 

Assumption 4 

"u is normal" means that Wi(0, u 2 (t), . . . , u Q (t) are jointly 
normally distributed. 

Assumption 5 

"u is not autocorrelated" means that there is no correlation between 
the disturbance of one equation and previous values of itself. 



"u is not correlated with z" means that no exogenous variable — in 
whichever equation it appears — -is correlated with any disturbance, 
past, present, or future, of any equation in the model. 

On these assumptions, the likelihood function of the sample is 

L - (2»)-"»(det J) s (det fa*])"*' 2 exp { - % J u,[<^]u.} (5-6) 


which should be compared with (2-2). The analogy is perfect. The 
expression in the curly braces can also be written 

« g h 

Another way to write the likelihood function is 

^JJ« B ( 5 K« A W (5-7) 

L m (2ir)- s >*(det J) s (det M)" 5 ' 2 exp { -\i £ Ax(«)[d'*]x(a)A} (5-8) 


which brings out the fact that L is a function (1) of ail the unknown 
parameters @ h, y h, <r vh y <T h) and (2) of all the observations \(s) (s = 1, 2, 
. . . , S). The function's logarithmic form 

S- 1 log L « - Y 2 log 2t + log det J - Y 2 log det [4 gh ] 


-Vl^I Ax(s)[d'*]x(s)A (5-9) 
is easier to use. 

5,4. Stochastic independence 

The seventh Simplifying Assumption: 6 gti is a diagonal matrix, or 

cov (u 0) u h ) = for g j* h 

is not obligatory, but it is easy to rationalize. It states that the 
disturbance of one equation is not correlated with the disturbance in any 
other equation of the model in the same time period — something quite 
different from Assumption 6. 
Recall that each random term is a gathering of errors of measure- 


ment, errors of aggregation, omitted variables, omitted equation!, and 
errors of linear approximation. Assumption 7 states that either 
(1) the #th equation and the /ith equation are disturbed by different 
random causes, or (2) if they are disturbed by the same causes, dif- 
ferent "drawings" go into u g (t) and u h (t). This assumption is dearly 
inapplicable in the following situations: 

1. In year t, all or nearly all statistics were subject to larger than the 
usual errors, because of a cut in the budget of the Statistics Bureau. 

2. Errors of aggregation affect mainly national income (because of 
shifts in distribution), and national income enters several equations of 
the model. 

3. Omitted variables (one or more) are known to affect two (qf more) 
equations. For instance, weather affects the supply of watermelons, 
cotton, and whale blubber. Now if the model contains equations for 
watermelons and blubber, the inclusion of weather in the random term 
does not hurt, because relatively independent drawings of waather 
(one in the Southeast, one in the South Pacific) affect these two 
industries. However, if watermelons and cotton are included in the 
model, both of these are grown in the same belt, the weather affecting 
them is one and the same, and Assumption 7 is violated. 

Assumption 7 simplifies the computations (1) because it leaves fewer 
covariances to estimate, (2) because det 6 oh becomes a simple product 
Uffggj and (3) because all the cross terms 1 in (5-7) drop out. This can 
reduce computations by a factor of 2 or 3 for a model of as few as three 
equations and by a much greater factor for larger systems. 

Digression on Jacobians 

The likelihood function involves a term expressed m det J» the 
Jacobian of the functions u, say, with respect to the variables y; 
we have disregarded det J until now, since we have taktn it on 
faith to be equal to 1. This is no longer true in a many-equation 
model. Here J is a matrix of unknown parameters, the same /Js, 
in fact, that we are trying to estimate with the likelihood function. 

The main ideas behind J are three: 

1. If you know the probability distribution of a variable u (or 
several variables u h w 2 , . . . , Uq), then you can find the proba* 
1 Those for which g ^ h. 


bility distribution of a variable y related to u functionally (or of 
several y's related functionally to the w's). 

2. If the w's and t/'s are equally numerous and if the functions 
connecting the two sets are one-to-one, continuous, and with con- 
tinuous first derivatives, then the matrix J of all partial deriva- 
tives of the form du/dy will have an inverse. 

3. If conditions I and 2 arc satisfied, then wo can calculate the 
joint probability distribution q of the y'a from the known joint 
probability distribution p of the w's (omitting the subscript t) as 


p(ui,u 2 , . . • ,u G ) du\ du 2 ' ' • dug 

= det J • p(u h u 2 , . . . ,u ) dy x dy 2 * • • dy 
or q(yi,y2, . . . ,:!/o) dyi dy 2 • • • dy a 

= det J • p(u h u 2i . . . ,u ) dyi dy 2 • • • dy G 

I shall illustrate these three ideas by examples. 

Example 1 

Let u be a single variable whose probability distribution we 
know to be as follows: 

Value of u 

Probability p(u) 









Let y be related functionally to u as follows: 

y(u) - w 2 - 4w + 3 (5-11) 

As u takes on its four values, y takes on the corresponding 
values ?/(-4) = 35, t/(-3) = 24, y{\) = 0, r/(3) = 0. Since we 
know how often u is equal to —4, —3, 1, and 3, we can find how 

oiten y is equal to oo, zi, ana 


Value of y 



Probability ^(?/) 


Example 2 

The same can be done with several y's and u's conneetid by m 
appropriate set of functions, for instance, 

2/i = -t*i -|- Zu\ - Ui 
2/2 = e~ ui + log u 2 

provided the probability distribution p(wi,Wj,Wa) is known. 

Relation (5-11) is not one-to-one, since, for every value of y t U 
can have two values. Accordingly, in Example 1 the second 
condition is violated, and the Jacobian is undefined. The same is 
true for Example 2. 

Whenever the functional relation between the w's and y's is 
one-to-one, [du/dy] and [dy/du] are single-valued and their 
determinants multiply up to the number 1. 

Example 3 

y(u) = 3w -f log u — 4 

Though it is very hard to express w in terms of y, we know 
that, since dy/du = 3 + \/u = (3w + l)/w, the Jacobian 
j = du/dy - w/(3w + 1). 

Example 4 

2/i « — Ui + w 2 

2/2 - e~ ui + log u 2 + 5 


Here we can compute det J from knowledge of 


raj- -i 

L^WJ W2 

+ e- tt » 

since it follows that det J = u 2 /(u 2 e-" x — 1). Therefore, by 
(5-10), the probability distribution of the y's is 

9(2/1,2/2) d?/i efo/2 = 


w 2 e" WI - 1 

p(u h u 2 ) dyx dy* 

Now, what relevance does all this have to econometrics? 
Very simple. Let 2/1, yt t • • • > Vq De endogenous variables, and 
let u h u 2 , . . . , u Q be the random errors attached to the struc* 


tural equations. The model's G equations are explicit functional 
relations between the y's and the u'a, like (5-12). Directly, we 
know nothing at all about the probability of this or that combina- 
tion of 2/'s. Nevertheless, (5-10) allows us to compute this proba- 
bility, namely, in terms of J and the probability distribution 
of the u's. It turns out that the right-hand side of (5-10) involves 
only the parameters we seek, the observations we can make, and 
the probability distribution p, which we have already specified 
when we constructed the model. 

If the structural equations are all linear, as in (5-1), the 
matrix J of all partial derivatives of the form du/dy turns out 
to be nothing but the matrix B itself. 

1 #12 * * * 01G 

021 1 * ' * 020 


-&?i 002 * ' ' 1 - 

- B 

5.5. Interdependence of the estimates 

Now that we know that J = B, we can both find the values /3, 7, <r 
that maximize the likelihood function (5-9) and compute its actual 
value. Actually we do not care how large L itself is. 

Naturally, maximizing such a function by ordinary methods is a 
staggering job; we won't undertake it. In fact, nobody undertakes it 
by direct attack. We shall use (5-9) to answer the following question: 
In order to estimate this particular parameter or this particular 
equation, do we need to estimate all parameters? The answer, 
generally, is yes. 

Note first of all that the maximum likelihood method of estimating 
B, r, and 6 h differs from the naive least squares method quite radically, 
because the least squares method does not involve the term log det B 
at all. In other words, the least squares method, if applied to the 
model one equation at a time, omits from account the matrix B; it 
does not allow the parameters of one equation to influence the estima- 
tion of the parameters of another; nor does it allow the covariances a h 


to influence in the least the parameter estimates of any equation that 
is being fitted. 

Finally, the least squares technique estimates the covariances & g9 
one at a time without involving any other covariance. Contrariwise, 
in maximum likelihood, the estimates $ of one equation affect the $s 
and 1 s of another; the &s of one equation affect the /3s and f s of another; 
and one & affects another. 

In a word, the sophisticated maximum likelihood method is very 
expensive from the point of view of computations and is probably 
more refined than the quality of the raw statistical data warrants. 
Econometric theory is like an exquisitely balanced French recipe, 
spelling out precisely with how many turns to mix the sauce, how 
many carats of spice to add, and for how many milliseconds to bake 
the mixture at exactly 474 degrees of temperature. But when the 
statistical cook turns to raw materials, he finds that hearts of cactus 
fruit are unavailable, so he substitutes chunks of cantaloupe; where the 
recipe calls for vermicelli he uses shredded wheat; and he substitutes 
green garment dye for curry, ping-pong balls for turtle's eggs, and, for 
Chalifougnac vintage 1883, a can of turpentine. 

Two courses of action are open to the econometrician who is reluc- 
tant to lavish refined computations on crude data: 

1. Use the refined maximum likelihood method, but reduce the 
burden of computation by making additional Simplifying Assumptions. 

2. Water down the maximum likelihood method to something more 
pedestrian but not quite so naive as least squares. Limited informa- 
tion } instrumental variable, and other techniques are available; they 
are the subject of Chaps. 7, 8, and 9. 

5.6. Recursive models 

If B is a triangular matrix, 1 the model is called recursive; and its 
computation is lightened, because there are fewer 0s to estimate and 
because det B = 1. 

The economic interpretation of a recursive model is the following. 
There is an economic variable in the system (say, the price of coffee 
beans) that is affected only by exogenous variables (like Brazilian 
weather) ; next, there is a second economic variable (say, the price of a 

1 B is triangular if p g h = for all g < h. 


cup of coffee) that is affected by exogenous variables (tax on coffee 
beans) and by the one endogenous variable (price of coffee beans) just 
mentioned. Next, there is a third economic variable (say, the number 
of hours spent by employees for coffee breaks) that depends only! on 
exogenous variables (the amount of incoming gossip) and (one or both 
of) the first two endogenous variables but no others; and so on. 


5.A In the recursive system 

2/i = 7*i + u 

V2 = Pyi -f yiZi + y 2 z 2 + v 

let the Simplifying Properties hold for u, v with respect to the exoge- 
nous variables. Prove that, if $ is estimated by naive least squares, 
that is, if 

S sg ^(«i)(«i) 

W (Vlt«|.*l)(l/l.«l.*l) 

then |5 is biased. 
5.B In the recursive model 

x t = Py t + u t 
y t = yxt-\ 4- v t 

show that j5 and ^ are unbiased but that least squares applied to the 
autoregressive equation obtained as a combination of the two equations 
gives biased estimates. 

Further readings 

The notation of Sec. 5.2 is worth learning because it is becoming standard 
among econometricians. It is expanded in Koopmans, chap. 2. 

Jacobians are illustrated by Klein, pp. 32-38. The mathematics of 
Jacobians, with proofs, can be found in Richard Courant, Differential and 
Integral Calculus, vol. 2, chap. 3 (New York: 1953), or in Wilfred Kaplan, 
Advanced Calculus, pp. 00-100 (Reading, Massachusetts: 1952). 

Klein, p. 81, gives a simple example of a recursive model. 



6.1. Introduction 

Identification problems spring up almost everywhere in econometrics 
as soon as one departs from single-equation models. This chapter far 
from exhausts the subject. In particular, the next two topics, 
instrumental variables in Chap. 7 and limited information in Chap. 8, 
are intimately bound up with it. The identification problem will arise 
sporadically in later chapters. 

Though this chapter is self-contained, some familiarity with the 
subject is desirable. I know of no better elementary treatment than 
that of Tj ailing C. Koopmans, "Identification Problems in Economic 
Model Construction," chap. 2 in Hood. I have chosen to devote this 
chapter to a few topics which, in my opinion, either have not received 
convincing treatment or have not been put in pedagogic form. 

The main results of this chapter are the following: 

1. There are several definitions of identifiability. I show their 

2. Lack or presence of identification may be due (a) to the model's 
a priori specification, (b) to the actual values of its unknown parame- 
ters, or (c) to the particular sample we happen to have drawn. 



3. There are ways to detect overidentificatton and underidentifica- 
tion. These ways are not always foolproof. There are several ways 
to remove over- or underidentification. 

4. In spite of the superficial fact that they are defined in analogous 
terms, underidentification and overidentificatioh are qualitatively dif- 
ferent properties: the former is nonstochastic, the latter stochastic; 
the former can be removed (in special cases) by means of additional 
restrictions, the latter is handled by better observations or longer 

6,2, Completeness and nonsingularity 

The following discussion applies to all kinds of models, linear or not, 
largo or small, but it will bo illustrated by this example: 

2/i + YnZi 4 71222 + 71323 4 71424 = ui 

0212/1 + 2/2 4 0232/3 4- 721Z1 4 72222 4 72323 * Ui (6-1) 

0312/1 4 2/3 4 73121 4 73222 = W 8 

This model describes an economic mechanism that works somewhat 
like this: 

1. The parameters /? and 7 are fixed constants. 

2. In each time period, someone supplies outside information about 
the exogenous variables z. 

3. In each time period, someone goes to a preassigned table of 
random numbers, and, using a prescribed procedure, reads off some 
numbers u h w 2 , u s . 

4. All this is fed into (6-1). 

5. Values for the endogenous variables, 2/1, 2/2, 2/s, are generated in 
accordance with the resulting system. 

The last step succeeds if and only if the linear equations resulting from 
step 4 are independent. Otherwise there is an infinity of compatible 
triplets (2/1,2/2,2/3). The model is complete if it can be solved uniquely 
for (2/1,2/2,2/3); otherwise it is incomplete. To generate a unique 
triplet it is necessary and sufficient that the matrix B be nonsingular, 
meaning that no row of it is a linear combination of other rows. 

The economic interpretation of singularity and nonsingularity 


is very simple. Each equation in (6-1) represents the behavior 
of a sector of the economy, say, producers, consumers, bankers, 
buyers, sellers, or middlemen. These sectors respond to exogenous 
stimuli z and economic stimuli y. They may respond to exogenous 
stimuli in any way whatsoever. In particular, it is quite permis- 
sible for them to respond in the same way to all exogenous stimuli 
(711 = 721 = T3i, 712 = 722 = 732, etc.). But, if the matrix is to be 
nonsingular, they should respond in different ways to the endogenous 
stimuli. No sector may have the same parameters as another; no 
sector's responses may be the average of two other sectors' responses. 
No sector may be a weighted average of any other sectors, as far aa 
economic stimuli are concerned. 

To illustrate singularity, consider a simple economy which consists 
of throe families responding to three economic stimuli but such that the 
third family makes an average response. Then B is singular, and the 
model containing the three families is incomplete. For nonsingularity 
the sectors must be sufficiently unlike each other. In fact this is the 
definition of sectors: that they are economically different from one 


6. A Prove the following theorems by using the common sense of 
the five steps of the above discussion: "If Assumption 7 is made, 
then B is nonsingular," and "If B is singular, Assumption 7 cannot 
hold." These two statements can be reworded: "An econometric 
model is complete if and only if its sectors are stochastically inde* II 
pendent." Appendix E proves this mathematically, but what is 
wanted in this exercise is an "economic" proof. 

6.3. The reduced form 

Every complete linear model By -f Tz = u can be reduced to 
y = IIz + v. These two expressions are called the original form and 
the reduced form. If it is complete, the original model (6-1) can be 
reduced to 

2/1 = 7TnZi -f 7Ti 2 Z2 + ITi&i + TuZa + V\ 

2/2 = 7T2lZl + *"22Z2 + Vr&Z + ^24^4 + V% (6-2) 

2/3 = 7T 3 lZl + A-32Z2 + Tz&i + ^34^4 + #3 


Some obvious properties of (6-2) are worth pointing out: Its random 
disturbances Vi, v* f *>3 are linear combinations of the original random 
disturbances'and share their'properties. However, the v's have different 
covariances from the it's. In particular, the v's are interdependent 
even if the u's were stochastically independent. (We seldom have to 
worry about the precise relation among the w's and v's.) Unlike the 
typical original form, each equation of the reduced form contains all 
the exogenous variables of the model. 

Each equation of the reduced form constitutes a model that satisfies 
the Six Simplifying Assumptions of Chap, 1 and, therefore, may validly 
be estimated by least squares; these estimates are called its. If it is 
possible to work back from the its to estimate unambiguously the 
coefficients /3, 7 of the original form, we shall call such estimates /3, -f 
and say that (6-1) is exactly identified. Finally, the coefficients of the 
two forms (6-1), (6-2) are connected as follows: 

— 7ll = 7TH —721 = ^21^11 + 7T21 + 0237T31 731 = /?3l7Tn + 7T 3 i 
~7l2 = 1T12 —722 — 0217T12 + 7T22 + 023^32 732 = Pn^l* 4* ^32 
~7l3 = ^13 —723 : = ^21^13 + T23 + ^23^33 = /33lTi3 4" ^33 

— 714 = ^14 : = /?2l7Tl4 + 7T24 + 0237T34 = ^ZlTTu + ""34 


It is possible, but messy, to solve for im, •- • • , xt4 in terms of the 
0s and 7s. The important fact is that, in general, all its are a priori 
nonzero in the reduced form, even if many of the j8s and 7s are a priori 
zero in the original form. 
Relations (6-3) can be written much more compactly: 

-r = Bn (6-4) 

6.4. Over- and underdeterminacy 

As a preview for the rest of this chapter, imagine that (6-1) is 
complete. If so, its reduced form (6-2) exists and can be estimated by 
least squares. Let the estimates be tfn, . . . , #34. 

Now consider the leftmost column of equations in (6-3). Evidently 
the 7s can be computed right away from the 7rs, uniquely and unam- 
biguously. We say, then, that 711, 712, 713, 7i4 are exactly identified. 

Consider next the last two equations of (6-3) ; they give rise to two 


estimates of 03i, namely, —#33/^13 and ~*fr u/fut which in general are 
quite different, no matter how ideal the sample. When this happt m 
to parameters, we say that they are (or the equation that g@ii&ifi§ 
them is) overidentified; accordingly, system (6-3) over deter inirm #31. 

Consider now the middle column of (G-3). Its four equations 
underdetermine the five unknowns 2 i, 1823, 721, 722, 723. An equation 
to which such parameters belong is underidentified. 

Obviously, then, the identification problem has something to do 
with the number of equations and unknowns in the system — F m Bn. 
The Counting Rules of Sec. 6.7 will show this more precisely, 

6.5. Bogus structural equations 

Consider the supply-demand model 

SS (Supply) yi + p l2 y 2 - u x 

DD (Demand) 2 i2/i + 2/2 = u 2 "**' 

where y\ represents price and y 2 represents quantity; linear combina- 
tions of the true supply and demand are called bogus relations and are 
branded with the superscript © . A bogus relation may parade either 
as supply or as demand, 

SS® = j(SS) + k(DD) DD® = m(SS) + n(DD) 

where j, k, m, n are unknown numbers, but suitable to make the 
standardized coefficients pf lt pf 2 of the bogus relations equal to 1. 

The bogus coefficients are connected with the true coefficients as 

Pfi - J + W21 = 1 /3? 2 - jfiit + k 

0® - m -f- nfai pf 2 = m/3i2 + n = 1 

The bogus supply contains a random term 

u? - jui + ku 2 

and the bogus demand contains an analogous term 

uf = mu\ + nu 2 
Later on we shall use the following relations between the eovariances ! 


of the bogus and the true disturbances: 

var uf « j 2 var u x + 2jk cov (u h u 2 ) + k 2 var w 2 

var uf = m 2 var U\ -f 2mn cov (1*1,^2) + n 2 var u 2 (6-6) 

cov (wf^f ) = jm var Ui -f* (jn + m/c) cov (1*1,^2) + ^n var w 2 

6.6. Three definitions of exact identification 

The discussion that follows is meant to apply to linear models only. 
Some results can be extended to other types of models (but not in this 

A model or an equation in it may bo cither (exactly) identified, or 
underidcntificd, or over-identified. Setting aside for the moment the 
last two cases, here are three alternative definitions of exact identifica- 
tion, one in terms of the statistical appearance of the model, ~) in 
terms of maxima of the likelihood function L, and one in terms of the 
probability distribution of the endogenous variables. 

Definition 1. A model is identified if its structural equations "look 
different" from the statistical point of view. An equation looks 
different if linear combinations of the other equations in the system 
cannot produce an equation involving exactly the same variables as the 
equation in question. 

Thus the supply-demand model (6-5) is not exactly identified, 
because both equations contain the same variables, price and quantity. 

In the model 

SS 2/1 + /3i2?/2 + 7ii*i *= ui 

DD /?2i</i+ 2/2 =w 2 {p '° 

where t\ represents rainfall, a linear combination of SS and DD 
contains the same variables as SS itself. Not so for DD, because 
every nontrivial linear combination introduces rainfall into the demand 
equation. In this model the demand equation is identified, but the 
supply is not exactly identified. In such cases the model is not exactly 

Definition 2. A model is identified if the likelihood function L(S) 
has a unique maximum at a "point" A = A . This means that, if you 
substitute the values «° in L, L is maximal; at any other point L is 
definitely smaller. Similarly, an equation is exactly identified if the 
likelihood function L becomes smaller when you replace the set a£ of 


that equation's parameters by any other set of «J. This way of teking 
at the matter is presented in detail later on, in Sec. 6.12. 

Definition 3. Anything (a model, an equation, a parameter) is 
called exactly identified if it can be determined from knowledge of the 
conditional distribution of the endogenous variables, given th© exoge- 
nous. This is to say, it is identified if, given a sample that wag large 
enough and rich enough, you could determine the parameters in 
question. We know that, no matter how large the sample or how 
rich, we could never disentangle the two equations of (6-5). 

All three definitions appear to say that exact identification is not & 
stochastic property, for it does not seem to depend on the samples w© 
may chance to draw. We shall return to this question later on. 

One must be very accurate and careful about the terminology. 
Over-, under-, and exact identification are exhaustive and mutually 
exclusive cases. Identified means "either exactly or o vender* klfkd." 
Not identified means "underidentified." 

Underidentification occurs when: 

By linear combinations of the equations one can obtain a bogus equa- 
tion that looks statistically like some true equation (Definition 1). 

The likelihood function has a maximum maximorum at two or more 
points of the parameter space (Definition 2). 

Knowledge of the conditional distribution of the endogenous variables, 
given the exogenous, does not determine all the parameters of th© 
model (Definition 3). 

There are three principal ways to avert (or at least to detect) 
absence of exact identification: (1) constraints on the a priori values of 
the parameters; (2) constraints on the estimates of the parameters; 
(3) constraints on the stochastic assumptions of the model. 

6.7. A priori constraints on the parameters 

Two new symbols will speed up the discussion considerably. Sup- 
pose we are discussing the third equation of a model. A single asterisk 
will denote the variables present in the third equation, a double asterisk, 
those absent from the third equation. Asterisks can be attached to 


variables, to their parameters, or to vectors of such variables and 
parameters. The asterisk notation has now become standard in 
econometric literature, and Appendix F gives a detailed account of it. 

The commonest a priori restrictions on A are (1) zero restrictions, 
like 724 = 0; (2) parameter equalities in the same equation, for 
example, 721 = 722; (3) other equations involving parameters of 
several equations a 'priori. 

These cases have economic counterparts, which I proceed to 

Zero restrictions 

Zero restrictions are common and handy. A zero restriction says 
that, for all we know> such and such a variable is irrelevant to the 
behavior of a given sector. If nothing but zero restrictions are 
contemplated, then we have a handy counting rule (Counting Rule 1) 
for telling whether an equation is identified. 

If an equation of a model contains all the variables of the model, it is 
underidentified, because linear combinations of all the equations look 
statistically just like it. To avoid this underidentification, the follow- 
ing two conditions are necessary: 

1. That some variables (call them #**) be absent from this equation. 

2. That the variables (call them x*) present in the equation in 
question, whenever they appear in another equation, be mixed with at 
least one £**, 

In (6-1) the first equation is identified, because any intermixture of 
the second equation brings in variables y** and y** (double-starred 
from the point of view of the first equation), and intermixture of the 
third equation brings in y** t which is absent from the first equation. 
In (6-1) the second equation is underidentified, because the third 
equation can be merged into it without bringing in any variable that 
is not already in the pure, uncontaminated second equation. Finally, 
the third equation is identified, because an intermixture of the first 
equation introduces zj* and zf*, and intermixture of the second 
equation introduces ?/** and z** (the double stars are now from the 
point of view of the third equation). 

This example shows that underidentification can be detected by* 
checking whether given strategic parameters in the model are specified 
a priori to be zero or nonzero. This justifies the following statement: 


Counting Rule 1. For an equation to be exactly identified it is 
necessary (but not sufficient) that the number of variables absent 
from it be one less than the number of sectors. 

Thus, if G* and H* are, respectively, the number of endogenous and 
exogenous variables present in the gth equation, then for the gth 
equation to be identified it is necessary (but not sufficient) that 
G + H-Q3* + Hj) = G - 1, or that H - H* « G* - 1. 

Parameter equalities in the same equation 

Another quite common a priori restriction is to set two or more 
parameters of a given equation a priori equal. For instance, let us 
interpret (6-1) as a model of bank behavior, where Z\ represents 
balances of banks at the Federal Reserve and Zi represents balances of 
banks at other banks. It is conceivable that a commercial bank may 
conduct its loan policy by looking at its total balances and not at 
whether they are held at the Federal Reserve or at another bank. 
The restriction would be expressed 721 ~ 722. On the other hand, 
some other sector, say, foreign banks, may treat the two kinds of 
balances differently, 732 5^ 731. Under these conditions, if the third 
equation is intermixed with the second equation, the result cannot 
masquerade as the second equation, because the bogus second equation 
would have different coefficients for z\ and z 2 , contrary to the a priori 
assumption that the response to all balances (Federal Reserve and 
other) is identical. 

Linear equations connecting the parameters of different equations 

Suppose that a model contains a production function and an equa~ 
tion showing the distribution of national income by factor shares. 
Then the coefficient of the share of labor is a priori equal to the labor 
coefficient of the production function, on the grounds of the marginal 
productivity theory of wages. 

Collectively, all the linear restrictions on A discussed so far can be 
capsuled into Counting Rule 2. Let A** be what is left of A if we throw 
out the columns corresponding to the variables present in the gth 

Counting Rule 2. For the gth equation to be exactly identified it is 
necessary and sufficient that the matrix A** have rank G — 1. 


These tests and counting rules can (and should) be applied before 
you start computations. 

There are no convenient counting rules for nonlinear restrictions on 
the parameters. 

Inequalities such as a > or a > do not help to remove under- 
identification. For instance, knowledge that demand is downward- 
sloping and that supply is upward-sloping does not help to identify 
the model (6-5). 

6.8. Constraints on parameter estimates 

Consider again the supply-demand model (6-7). It states a priori 
that rainfall influences supply and not demand; and this restriction 
identifies the demand equation (but not the supply). Now imagine 
that you draw an unlucky sample made up of cases where the other 
random elements U\ have annihilated the theoretical effect of rainfall. 
You will get y ii (for this sample) equal to zero. The sample has 
behaved as if rainfall did not influence supply, i.e., as if the model were 
reduced to (6-5), where the demand was statistically indistinguishable 
from the supply. 

The moral of this is: If you are not a priori certain that supply is 
influenced by rainfall (not only theoretically but also in the sample 
period) then do not proceed with the estimation of demand. If you 
fear that rainfall fails to affect supply (whether in the sample or 
generally), then to estimate the demand introduce in the supply 
function another variable z 2 (say, last year's price of a competing crop) 
that you are certain influences (if ever so little) this year's supply both 
theoretically and in the sample period. The new model then is 

SS V\ -f 0122/2 + TllZl + 712^2 « wi ( „ g . 

DD foit/i +2/2 = u 2 

and so last year's price takes on the burden that rainfall is supposed to 
carry in making the demand identifiable. 

A very neat extension of Counting Rule 2 covers all these require- 
ments: For exact identification , the ranks of A** and A** must equal 
G - 1. 

We can show this in a third way. 1 If, in the original model (6-7), 
1 With acknowledgments to T. C. Koopmans, in Hood, pp. 31-32. 


Yn is truly nonzero, then it is impossible to construct a bogus demand 
equation without detecting it. Take as the bogus demand 

DD® = %DD + %SS 

Then the bogus random term of demand is 

© 2ui + u 2 7n . 

Then cov (uf,z\) is not zero, and will show up in estimation, unless 
the sample is the unlucky one in which rainfall is neutralized (f n ~ 0) 
by the random factor. If, upon completing the estimation, we 
discover that mj,.,, is quite different from zero, then we can detect 
underidentification but we cannot remove it. On the other hand, the 
discovery that w;,.,, is nearly zero is no guarantee that we have 
identified demand if there is a strong reason to suspect that supply is 
unaffected by rainfall. 

6.9. Constraints on the stochastic assumptions 

Let the random terms of (6-5) satisfy Simplifying Assumptions 
1 to 7, so that cov (ui f u 2 ) = 0. Will this help to identify the supply 
SS? Sometimes. Suppose that we knew beforehand that var u\ 9 
cov (ui,u 2 ), var u 2 were of the orders of magnitude 3, 0, 10, respec- 
tively. The "deception" can be detected from (6-6) if 2(jB? ) 2 , which 
is the estimate of var uf, is very different from 3. This can have 
happened by chance in the sample used, but it becomes more and 
more unlikely the more 2(j2?) 2 differs from 3. On the other hand, 
the bogus variances and covariances may have nothing peculiar about 
them — indeed they may equal 3, 0, and 10, respectively, because of a 
special set of values that j, k, m, and n have taken on, for example, 
J — %> -h = }4> m ■■ — M> n — K- Therefore, in general, there is 
no guarantee that SS® will look statistically different from SS, even 
if we have complete knowledge of the underlying covariances of the 
random term. 

Another way to impose identification on a model is to say something 
specific about the variances of the random terms. This was done by 


Schultz in some early studies of agricultural markets. 1 In some of 
Schultz's work, both supply and demand are functions of the same two 
endogenous variables (price and quantity) and of random shocks. 
However, supply is more random than demand. Then the scatter of 
observed points will be more in agreement with the demand than with 
the supply function. Ambiguity is not eliminated entirely, but it is 
reduced as the randomness of supply increases relative to the random- 
ness of demand. In the notation of (G-6), the restriction takes the 
form var U\ ■ q var w 2 ; and identification improves with increase in q. 

More complex restriction of this kind could also help. 

To summarize the results of Sees. 6.7 to G.9: 

1. Identification can be checked before computing by use of the 
Counting Rules as applied to A. 

2. If you fear that art equation is underidentified because you are not 
sure whether a given variable x reacts significantly, estimate the equa- 
tion anyhow and then check whether the covariance of x with the 
residual % is near zero; if not, you may have identified the gth equation. 
If m x .u a is near zero, you have not identified your equation. If the 
numerically largest determinant of rank G — 1 from A** is close to zero, 
X probably did not play a significant role. 

3. There are tests that help detect underidentification. 

4. It is sometimes possible to remove underidentification. 

6.10. Identifiable parameters in an underidentified 

When an equation is underidentified, is it perhaps possible to 
identify one or more of its parameters, though not all? For instance, 
what about the identifiability of 7n in (6-7) ? Intuition says that 711 
cannot be adulterated by linear combinations of DD, since Z\ occurs 
only in the supply SS, Intuition is wrong if it concludes that this 
fact makes 711 identifiable. Applying (6-4), we havo 

— 7ll = 7Tll -f* 0127T21 

The TS can be computed from the reduced form 

1 Henry Schultz, The Theory and Measurement of Demand, pp. 72-81 (Univer- 
sity of Chicago Press, Chicago: 1938). 


2/i -» vnZi + Vi 

Vl «■ 7T2121 + Vf 

but /3i2 is and remains unknown and, therefore, so does ?n. 

So, contrary to intuition, the fact that a given variable enters one 
equation of a model and no others does not make its coefficient identi- 
fiable. Underidentification is a disease affecting all parameters of 
the affected equation. For, if the gth. equation is unidentified, this 
means that there are fewer equations than unknowns in the gth row of 
formula (6-4). All coefficients of the ^th equation enter (6-4) sym- 
metrically, and so none can have a privileged position over the others. 

Let us now ask whether we can identify, in an otherwise unidenti- 
fiable equation, the ratio of two unidentifiable coefficients. In special 
cases it may be both important and sufficient to know the relative 
rather than the absolute impact of two kinds of variables. Let us 
consider (6-8) as a model of the supply and demand for loans, where 
?/i is quantity of loans, 7/2 is interest rate, Z\ is balances at the Federal 
Reserve, and zi is balances at foreign banks. We are curious to know 
whether the two kinds of bank balances differ in their effects on the 
loan policy of a commercial bank. Is it possible to identify t\\/y\%! 
No, because (6-4) applied to this model yields 

""Til = Til 4" 012^21 
— 712 = 7Ti2 + 0127T22 

which cannot be solved for 711/712 so long as 0i 2 is unidentified. The 
most we can get is the relation 

7u 4- frn __ Y12 -f- ^12 

7T21 7T22 

which is a straight line in the 711,712 space, giving an infinity of pairs 
(7n, Y12). 


6.B Derive explicitly the equations — r = Bn for (6-8). 

6.C In the above exercise, compute the two values of 0n in terms 
of the coefficients of the reduced form. Under what arithmetical 
conditions would they be identical? Interpret this in economic terms. 


6.11. Source of ambiguity in overidentified models 

Let us return to (6-S), rewriting it for convenience 

SS q + Pup + ynr + 7nC - u x (Q Q . 

DD n q + V - w 2 K ~ } 

where q ,«■ quantity, p ■* price, r = rainfall, c — last year's price of a 
competing crop. Supply is underidentified, and demand is over- 
identified. For the latter we get from the reduced form two incom- 
patible estimates of the single unknown /3 2 iJ 

a' — ~ *21 off _ ~^22 

P21 — . "^ — ■ P21 5 — 

7Tn 7T12 

But why should the reduced form, if estimated by least squares, give 
two values for 2 i> the price elasticity of demand? The answer is in 
terms of the wobblings of the supply function. In (6-9), supply 
wobbles in response to random shocks u\ and to two unrelated exoge- 
nous variables, this year's rainfall r and last year's price c of a compet- 
ing crop. In Fig. 10a I have drawn some supply curves corresponding 
to different amounts of rainfall (+1, —1) for a fixed value of c (= 0). 
Observable points fall in the parallelogram ABCD. On the other 
hand, in Fig. 106 the variations in supply come not from rainfall (which 
is held constant at 0) but from last year's price only. Observations 
fall in the parallelogram EFGH. The first estimate of (3 

_ ftl .*!!.. *te*±* ( 6 .10) 

#11 ™>(q.c)-(r,c) 

corresponds to the broken line in Fig. 10a, because #21/^11 correlates 
price and quantity reactions as they result from variations in rainfall 
only. The other estimate 

off __ #22 m( r , p ).(r, c ) /n in 

—P21 — r~ — \p-Li) 

#12 M>(r,q)'(r,c) 

corresponds to the broken line in Fig. 106, because it correlates p to q 
as a result of variations in last year's price alone. 1 The sample must 
be very peculiar indeed that yields equal estimates ff n and £&. 

1 In expressions like (6-10) and (6-11), the heuristic device of canceling the 
"factors" c and r in numerator and denominator gives a correct interpretation 
of what is being correlated, provided that these "factors" appear on both sides 
of both dots. 



The explanation, then, is at bottom simple: When demand is 
overidentified, this means that both rainfall r and lagged prim c rrmke 
the supply shift up and down, and trace the demand relationship for us. 
The original form of the model shows this. The reduced form, how- 
ever, does not allow us to trace the demand uniquely, as the result of 
the combined effect of rainfall and lagged price. Rather, the reduced 
form gives us a choice of estimating the slope of the demand equation 
either as a result of rainfall-induced variations in supply or m a result 
of lagged-price-induced variations in supply. Essentially, then, either 
alternative leaves out some crucial consideration, namely, the fact that 







*X^<_^ Slope % 

— Mg 

(a) (ft> 

Fig. 10. Ambiguity in an overidentified equation. 

the omitted variable (lagged price and rainfall, respectively) also 
affects the price and quantity combinations that the sample shows. 
To show that (6-10) is a biased estimate, write p = u% — @n$> Then 

W(p,c).(r,c) = m>(u t ,c)>(r,c) ~ p21™>(q,ch(r,c), and SO 

-"•021 = ~021 + 

m( Ut ,eHr,e) 
W(,, c ).(r,e) 

The expected value of the bias term is not zero. This is easily seen 
from (6-9). Let r, c, and wi be fix^d, and let u* take on a set of con- 
jugate values 4-W2 and — u 2 ?* 0. Then, in (6-9), q necessarily takes on 
two different values q' and q", and thus the above denominator changes 
as ui takes on its conjugate values. Therefore, m(+u„eMr,0)/ftl(c',«Kr.«) 
and W(_ Ul , c ).( ri< j)/m( a '\ C ).( r , C ) do not add up to zero. To show that 


fi' n is a consistent estimate, consider that flim m< tt| , C ).(r.e) ■=• but 
Plim m( fllC ).( r , ) ?* 0. 


6.D If it turns out that #' 21 = /5ji, the sample moments must 
satisfy either the equation m rr m cc = m re m re or the equation 
m pc m qr = m pr m qe . The first of these declares that rainfall and last 
year's price are perfectly correlated in the sample. Interpret the 
second one. Hint: Use the fact that p = u 2 — fog. 

6.E If least squares are applied to the reduced form, obtaining 
tFs, prove the following: (1) that all parameters £, y that can be esti- 
mated by working back from the reduced form are consistent; (2) that 
all 7s that can be estimated (whether uniquely or ambiguously) are 
in general biased; (3) that all /3s that can be estimated (whether 
uniquely or ambiguously) are in general biased. 

In the following exercises, p (price) and q (quantity) are the endoge- 
nous variables. The exogenous variables are i (interest rate), / (liquid 
funds), and r (rainfall). 

6.F Show that in 

SS q 4- pup = u (exactly identified) 

DD 02i# 4- p + y n i *= v (underidentified) 

only 0i2 can be estimated unambiguously. 
6.G From the model 

SS q -f Pup 4- 7nr = u (overidentified) 

DD j5 2 i? 4- p 4- 72i« + 722/ = v (exactly identified) 

the reduced form leads to the following estimates of £12: 
at ' *n oft _ *« 

#21 7T22 

where § u ■■ m( 7 ,/, r ).(v,/, r )/m(.-,/, r ).( t -,/, r )and7F 2 i = wi( P ,/,r).(» , ,/,r)/w«,/,r).(» , ,/.r). 
Show that these estimates are biased and consistent. 
6. II In Exercise 6.G, find the bias (if any) for ^13, foi, ^22. 

6.12. Identification and the parameter space 

The likelihood function L(S) may or may not have a unique highest 
maximum as a function of the parameter estimates A. If it does, the 
model is (exactly or over-) identified. 

Fig. 11. Maxima of the likelihood function, a. Underidentificdj b, exactly 
identified; c. overidentified. 



Along the axes, labeled a and in Fig. 11, let me represent the 
parameter space. Usually this space has mdre dimensions, but I 
cannot picture these on flat paper. 

Underidentification is pictured in Fig. 11a. Here the mountain has 
either a flat top T or a ridge RR f , or both. Its elevation is highest in 
many places rather than at a single place; i.e., there are many local 
maxima. This means that several values of a and /3 are candidates for 
the role of estimates of the true a and /3. In the picture these can- 
didates lie in the cobra-like area PP'Q that creeps on the floor. 

When the system is (exactly or over-) identified, nothing of this sort 
happens. The mountain has a single highest point. If the system is 
exactly identified, this fact is the end of the story, and Fig. 116 applies. 
When the system is overidentified, then Fig. 1 lc applies. The moun- 
tain in Fig. 1 lc is the same as in Fig. 116, but we have several conflicting 
ways to look for the top. One estimating procedure allows us to look 
for the highest point of the mountain along, say, the 38th parallel; 
another equally admissible procedure tells us to look for it not 
along the 38th parallel but along the boundary XY between area 
I and area II. Accordingly, we get P' and P", two estimates of P that 
correspond to j3' 21 and flj'i of equations (6-10) and (6-11). 

6.13, Over- and underidentification contrasted 

The example in the figure suggests that overidentification and 
underidentification are not simple logical opposites, except in a very 
trivial sense — in relation to the Counting Rules. Table 6. 1 gives the 
contrasts among over-, exact, and underidentification. 

We say that underidentification is not usually a stochastic property, 
because it arises from the a priori specification of the model and not 
from sampling, and so it cannot be removed by better sampling. 
Stochastic underidentification is in the nature of a freak; it was 
illustrated in Sec. 6.8. On the other hand, overidentification is a 
stochastic property that arises because we disregard some information 
contained in the sample. Overidentification is removed if all the 
information of the sample is utilized — which means that reduced-form 
least-squares estimation must be abandoned. 



Table 6.1 
Degree of identification 







Unique maximum of 
the likelihood func- 

Does not exist 



A -priori restrictions for 
locating single high- 
est point 

Not enough 


Too many 

Ambiguity, if any, 
introduced because: 

You have not 
enough inde- 
pendent varia- 
tion in supply 
and demand 

No ambiguity 

In reduced form 
you disregard one 
or another eauae 
in the variation 
of supply 

Estimate of tho 
parameters if 
based on re- 
duced form 


Biased, con- 
sistent j 

Biased, con- 

Biased, consistent 


Biased,* con- 

Biased,* con- 

Biased,* consistent 

Is the degree of identi- 
fication a stochastic 

Not usually; yes, 
if in fact a 
variable fails 
to vary 


In special cases, unbiased. 

6.14. Confluence 

Multicollinearity and underidentification are two special cases of a 
mathematical property called, confluence. 

Multicollinearity arises when you cannot separate the effects of two 
(or more) theoretically independent variables because in your sample 
they happen to have moved together. This topic is taken up again 
in Chap. 9. 

To show the connection between underidentification and multi- 
collinearity, I shall use a model adapted from Tintner, p. 33, which 
contains both. 



Suppose that the world price of cotton is established by supply and 
demand conditions in the United States. Let the supply of American 
cotton q depend only on its world price p, while the American demand 
for cotton depends both on its price and on national income y. 

DD q - ap + #?/ -f u 
SS q-yp + v 


How, demand in this model is underidentified. If the sample comes 

Fig. 12. Confluence. 

from earlier years, when cotton was king, then the model, in addition, 
suffers from multicollinearity, because the national income was strongly 
correlated with the price and quantity of cotton. In the parameter 
space for a, £, and 7, the likelihood function L has a stationary value over 
a region of the space afiy. To picture this (Fig. 12) let us forget 7 — or 
assume that someone has disclosed it to be -f-0.03. The true value of 
the parameters is the point (a,/?,0.03). The ambiguous area, over 
which L has a flat top, is the band PQRS in Fig. 12. If a sample is 
taken from more recent years, multicollinearity is reduced, because 
national income y and the world price of cotton p are no longer so 

6.14. CONFLUENCE 105 

strongly correlated as before. In Fig. 12, the gradual diversification of 
America's economy would appear as a gradual migration of the points 
in the band PQRS toward a narrower band around the curve MN. If 
the time comes when cotton becomes quite insignificant, then multi- 
collinearity will have disappeared, but not the underidentiflcation. 
In the figure, the band will have collapsed to the curve M 2V, but not 
to a single point. 


6.1 Suggest methods for removing multicollinearity and discuss 

Digression on the etymology of the term 

Suppose, for purposes of illustration only, that in (6-12) the 
true values of the parameters are simply a = — 1, = 1, 7 » 1. 
Also suppose that national income and the price of cotton are 
connected by the exact relation 

V = 3p 

(The exactness is for illustrative purposes only. What follows is 
also true for the stochastic case y = 3p -J- w.) Then the demand 
can be written 

DD q = 2p + u 

Now, the following estimates of a and are consistent with 
all observations: 

1. The true values a = — 1, = 1 

2. The pair of values a == 1, = }4 — because 

q = lp + %y 4- u = 2p -f u 

3. The pair of values a = 2, j3 = 0; and an infinity of other 
pairs, which can be represented as the collinear points of line AB 
in Fig. 13 

If we take a bogus demand function, then its parameters a©, 
/3® also form a collinear set of points — like line CD — that agrees 
with the sample. 

Removing multicollinearity causes EF to collapse into M , AB 



into N, and CD intoP; that is to say, removing multicollinearity 
collapses the band between the lines EF and CD into the line 
MNP. On the other hand, removing underidentification col- 
lapses the same band into the line AB. 

Nomu!tfcol!inearity^ v P 

Fig. 13. Multicollinearity. 

Further readings 

Koopmans, chap. 17, extends and refines the concept of a complete model. 

The excellent introduction to identification, also by Koopmans, was cited 
in Sec. 6.1. Klein gives scattered examples with discussion (consult his 

It is worthwhile to read the seminal article of Elmer J. Working, " What Do 
Statistical 'Demand Curves' Show?" (Quarterly Journal of Economics , vol. 41, 
no. 2, pp. 212-235, February, 1927), both for its contents and in order to 
appreciate how far econometrics has progressed since that time. 

Trygve Haavclmo's treatment of confluence, in "The Probability Approach 
in Econometrics" (Econometrica, vol. 12, Supplement, 1944), is hard going, 
but an excellent exercise in decoding. If you have come this far, you can 
tackle this piece. 

Those who appreciate the refinement and proliferation of concepts and 
are not afraid of flights into abstraction may glance at Leonid Hurwicz, 
"Generalization of the Concept of Identification," chap. 4 of Koopmans. 

Herbert Simon, "Causal Order and Identifiability," in Hood, chap. 3, 
shows how an econometric model can be analyzed into a hierarchy of sub- 
models increasingly more endogenous and how the hierarchy accords with 
the statistical notion of causation and that of identifiability. 


Instrumental variables 

7.1. Terminology and results 

The term instrumental variable in econometrics has two entirely 

unrelated meanings: 

1. A variable that can be manipulated at will by a policy maker as a 
tool or instrument of policy; for instance, taxes, the quantity of money, 
the rediscount rate j 

2. A variable, exogenous to the economy, significant, not entering 
the particular equation or equations we want to estimate, nevertheless 
used by us in a special way iiji estimating these equations 

In this work only the second 'meaning is used. 

This chapter explains and rationalizes the instrumental variable 
technique. It shows: j 

1. That the technique, though at first sight it appears odd, is 
logically similar to other estimating methods and also quite reasonable 

2. That, if the choice of instrumental variables is unique, the model 
is exactly identified, and that the instrumental variable method is 
equivalent to applying least; squares to the reduced form and then 
solving back to the original form 



To understand the logic of instrumental variables we must first take 
a deep look at parameter estimation in general. 

7.2. The rationale of estimating parametric relationships 

Two ideas dominate the strategy of inferring parametric connections 
statistically. The first idea is that variables can be divided into 
causes and effects. The second is that conflicting observations must 
be weighted somehow. 

Causes and effects 

There can be one or more causes, symbolized by c, and one or more 
effects, symbolized by c. Various "instances" or "degrees" of c and e 
will carry a subscript. A parameter is nothing more than the change 
in effect (s), given a change in cause (s). Symbolically this can be 
represented as follows: 

~ . change in effect (s) Ae ,- 1N 

Parameter = — — r ^ rr = t- (7-1) 

corresponding change in cause (s) Ac 

This relation 1 is fundamental in Chaps. 7 and 8; the general theme of 
these chapters is that all techniques of estimation are variations and 
elaborations of (7-1). 

The change in cause (s) and effect (s) can be any change whatsoever. 
For instance, the change in effect e may be e\ — 62, or €2 — Ci, or 
0i5 — 026? in general, e t — e*. The corresponding changes in cause c 
are d — c 2 , and c 2 — Ci, and C15 — 025; in general, c t — ce. 

Usually, however, the change is computed from some fixed reference 
level of the effect (s) or the cause (s) — and this fixed level is most 
typically the mean. 2 So parameters are usually computed by a 
formula like (7-2) rather than (7-1). 

Parameter estimate = ■ = ft (7-2) 

Ct — c N 

1 It is meant to be a conventional relationship. Only in the simplest linear 
systems is it true that the numerical value of a parameter can be expressed as 
simply as in equation (7-1). 

9 Ideally, the mean of the population. In practice, the mean of the sample. 
In linear models the distinction is immaterial for most purposes. 


This is merely a convenience, which does not affect the logic of par&me- 
ter estimation. Henceforth, all symbols Ac, Ac, e tf c h or dimply e 
and c represent deviations from the corresponding mean. 

The problem of conflicting observations 

What happens when two or more applications of (7-2) give different 
values for x? This is very likely to happen in stochastic models, 
because in such models the effect e results not merely from the explicit 
cause c but also from the disturbance u. Which of the several con- 
flicting values should be assigned to the unknown parameter *■? 
The problem arises in all but the simplest cases of parameter 
estimation. I 

In general, the parameter estimate is a weighted average of quotients 
of the form e/c. Take the model e t — yc t 4- u t . Then the weighted 
estimate of 7 is 1 

7 = — wi + — W2 + • • • H — u>i 

C\ ,C2 c, 

Any set of weights Wi, W2, . . J , w, will do, provided they add up to 1. 
If you want to attach much significance to the instance ew/cw, make 
wn large; if you want to disregard the instance 627/027, make Wn m 
or even negative. | 

Let us for simplicity restrict ourselves to just two observations. 
One of the many possible sets of weights is the following: 

ilL w m -A 

~r : C2 C\ T* Cj 

u>i - irxb ^2 - :txt« C?-3) 

Then, using these weights, the weighted estimate is also tha familiar 
least squares estimate, because 

Ci c\ , 62 j c\ e\C\ + C2C2 ___ tn w 

Ci cf + c| c 2 c[ + c§ " c| -f- c| "" ?/l<* 

So the least squares estimate amounts to nothing more than a 
special method for weighting; conflicting values of the ratios e/c. 
Now two questions arise: (1) Why should the weights (7-3) be functions 
of the c's or have anything at all to do with the cause c? and (2) why 


c 2 


c { 


log c 2 


Zc 2 

2|c 3 | 


s VR 

2 log c 2 


should we pick those particular formulas and not, for example, absolute 

Wl = N „, 2 = N ( 7^) 

|ci] 4- \Ci\ \ci\ + |cs| v ' 

or cubes, square roots, or logarithms? 

The answer to question 1 is that the more strongly the cause departs 
from its average level, the more you weight it. It is as though we said 
that the real test of the relationship e t = yc t + u t is whether it stands 
up under atypical, nonaverage conditions (i.e., when c t is far from c). 

But now why should one [as formulas (7-3) and (7-4) suggest] give 
equal weight to +c t and — c<? The common-sense rationale here is 
that the same credence should be given to a ratio e/c when c is atypically 
small (relative to its mean) as when it is atypically large. This 
requirement is met by an evenly increasing function of c, for instance: 


The answer to question (2) is this: From the many alternative 
formulas in (7-5), c 2 /2c 2 is selected because of the assumption that 
ti is normally distributed, in which case least squares approaches 
maximum likelihood. A different probability distribution of the 
disturbances would prescribe a different set of weights for averaging 
or reconciling conflicting values of e/c. 

7.3. A single instrumental variable 

With these preliminaries well digested, we are ready to understand 
the logic of instrumental variables. Suppose that 

Vt - fat + u t (7-6) 

is a model for the demand for sugar. This equation is known to have 
come from a larger system in which p (price) and q (quantity) are 
endogenous and a great many other causes are active. We call (7-0) 
the manifest model and all the remaining relationships the latent model. 
Suppose that z represents some exogenous cause affecting the economy, 
say, the tax on steel. Now in an interdependent economic system the 
tax on steel affects theoretically both the price of sugar and the quantity 


of sugar bought, because a tax; cannot leave unaffected either the price 
or the quantity, of any substitute or competitive goods, and sugar surely 
must be somewhere at the end of the chain of substitutes or com- 
plements of steel. ! 

The method of instrumental variables says that you ought to 
compute an estimate of y (from two observations 8 = 1, 2) as follows: 



y ** fir Wl + t* w * 

$1 Q2 

using as weights 


9i*i '■ 

w* - qiZ * 

W\ — 


1 4" 9222 

W2 . 

qiZi + qzZi 

so that the estimate of y 

is 1 

r _ Pi 9i*i 
qi q\Z\ + 52*2 

+ P2 

92 \q\Z 

q&2 __ V\Z\ + P2^2 t 

1 + 92*2 9i*i + 92*2 

3 2*5 

Every ounce of common sen&e in you ought to rear itself in rebellion 
at this perpetration. You ought to protest, saying: "Nonsen.^e! 
My boss at the Zachary Sugar Refinery will fire me unceremoniously 
from my well-paid and respected job of Company Economotrician if I 
tell the Vice-President that I multiply the price and quantity of sugar 
with the tax on steel to estimate demand for sugar! Better give me a 
good argument for this act of alchemy. Moreover, did it ever occur 
to you that m zq in the denominator could conceivably be equal to zero-— 
and certainly is in some samples? 1 This predicament could not arise in 
leastsquare weights like (7-3). " 

I hasten to reply to the last point first. The possibility that m q , 
might be zero is the reason why z should not be chosen haphazardly but 
rather from exogenous variables that have a lot to do with the quantity 
of sugar consumed. So, instead of the tax on steel, perhaps we ought 
to take the tax on coffee, honey, or sugar, or the quantity of school 
lunches financed by Congress. Still, it is possible that, in the sample 
chosen, the quantity of sugar q and the quantity of school lunches z 
happen to be uncorrected, and it is true that this sort of difficulty m 
unheard of in least squares, because m qq j ust cannot be zero : the quantity 
of sugar is perfectly correlated with itself. 

1 This sample would be a set of measure zero, as the mathematicians say* 



Now what do the weights (7-9) say? They say that, the more sugar 
consumption and school lunches move together, 1 the more weight 
should be given to A-p/Aq, the price effect of the change in the quantity 
consumed. There is another way to look at this matter: Write, 
purely heuristically, 


8 - ^ = 

Aq Aq/Az 


where Ap/Az is symbolized by 71 and Aq/Az by 72. How could we 
possibly interpret 71 and 72? 
Figure 14 represents both the latent and the manifest parts of the 







*2 X 


Fig. 14. The logic of the instrumental variable. 

model (7-0). The solid arrow represents the manifest mutual causa- 
tion p «-» q f which appears in (7-6). The broken arrows represent the 
latent model, which is not spelled out but which states that z affects 
p and q through the workings of the whole economic system. So the 
meaning of 71 and 72 is that they are the coefficients of the latent model. 
Since z is exogenous, why not estiniate 71 and 72 by least squares? It 
sounds highly reasonable. There should be no objection to this. 

?i == 

m t 
m t 

Insert (7-11) in (7-10), and then 

* _ Ap/A z 
P Aq/Az 



72 = — - 

m Z p/m z 

m tp 
m zq 

m zq /m t 

which justifies the instrumental variable formula (7-9). 
1 In a given instance, not over-all. 




It is well to ask now: in what way do the two examples p ■ fiq 4- u . 
and e = yc -f* w differ? Why may the second model be computed by 
least squares although the first has to be estimated by instrumental 
variables? The reason is that the second c — > e is a complete model in 
itself: causation flows from cjto e, and nothing else is involved. Oa 
the other hand, p = Pq + w hides a lot of complexity, i.e., causation 
from z to p and from 2 to q. These causations are unidirectional 
z -*p and z --* 5 and can be treated like c — > e; but p «-» g is not of the 
same kind, and must be treated by taking into account the hidden 

7.4. Connection with the reduced form 

The instrumental variable technique is very intimately connected 
with the method of applying least squares to the reduced form. 
Assume for the moment that $ is the only exogenous variable affecting 

the economy and that the complete model is 


P — Pq = u 

1 L 1 (7-13) 

— P t — *q — z — v 

whose first equation is manifest (the solid arrows in Fig. 14). The 
second equation is latent and corresponds to the broken arrows. 

: Exercise 

7. A Explain why the complete model cannot contain three inde- 
pendent equations, one for p «;-» q, one for z --* p, and one for z ••* q. 

The reduced form is 


Dp == 0z H u + @v 

\* (7-14) 

Dq =■ z tt+ v 

I ?* 


where D is the determinant I/72 + 18/72. 

If the second equation in (7-13) contained another exogenous 
variable, say, 2', then the first iequation would be overidentified. This 
fact would be reflected in the' instrumental variable technique as the 


following dilemma: Should we use z or z' as the instrumental variable? 
On this last point I have more to say in Sees. 7.5 to 7.7. 

7.5. Properties of the instrumental variable technique 
in the simplest case 

The technique is biased and consistent, where naive least squares 
is biased and inconsistent. 

Bm*!s-fi + 2z> m fi + 1 _J22 — _ — (7-15) 

rn tq m, q r Z) ra„ - (l/yi)m gu + m zv v ' 

^ g , _1 ^iu — (1/Tl)^uu + Wvu 

D m it - (2/yi)m tu + 2m tv + (l/y\)m uu - (2/7i)m rtt + m vv 


Under the ordinary Simplifying Assumptions, <r tu = <r„ = ow = 0; 
so, for large samples, the first expression approaches 0, and the second 

Expression (7-15) gives us additional guidance for selecting an 
instrumental variable. To minimize bias, the following conditions 
should be fulfilled, either singly or in combination: 

1. -m att numerically small 

2. m„ numerically large 

3. D numerically large 

The first condition says that z should be truly exogenous to the sugar 
market. Appropriations for school lunches are better in this respect 
than the tax on sugar, because the tax might have been imposed to 
discourage consumption or to maximize revenue, in which case it 
would have some connection with the parameters and variables of the 
sugar market. 

The second condition says that, in the sample, the instrumental 
variable must have varied a lot: if the tax on sugar varied only trivially 
it had no opportunity to affect p and q significantly enough for us to 

7.6. EXTENSIONS i 115 

capture /3 by our estimating procedure. From this point of view the 
tax on sugar might be less desirable as an instrumental variable than 
some remote but more volatile? entity, say, the budget's appropriations 
for U.S. Information Service, j 

The third condition says that D = l/? 2 + 0/7 1 should be numeri- 
cally large; that is, that 71 and 72 should be numerically small relative 
to (3. This says that, to minimize bias, p and q should react more 
strongly to each other (in the manifest model) than to the instrumental 
variable in the latent model, j It requires that price and quantity be 
more sensitive to each other in the sugar market than to such things 
as the U.S.I.S. budget, the tax on honey, or, for that matter, the tax 
on sugar itself. 

It is not easy to find an instrumental variable fulfilling all 
conditions at once. However, if the sample is large, any sort of. 
instrumental variable gives better estimates than least squares. 

7.6. Extensions 



The instrumental variable; technique can be extended in several 
directions: I 

1. The single-equation incomplete manifest model may con- 
tain several parameters to be estimated. For example, the model 
V = 0i<Z + $iV + u requires two instrumental variables z\ and z*. 

The estimating formulas are analogous to (7-9) : 

S = m <'i''«>|iMrt g __ m (*..« t )(g.p) (7-17) 

All the criteria of Sec. 7.5 fo:: selecting a good instrumental variable 
are still valid, plus the following* z\ and z 2 must really be different 
variables, that is, variables not well correlated in the sample; else the 
denominators approach zero, and the estimates &, £ 2 blow up. 


7.B If we wish to estimate the parameters of p — fiq + yz + u, 
where z is exogenous, is it permissible for z itself to be one of the 
instrumental variables z\ and 2 2 ? 

2. The incomplete manifest model may consist of several equations, 


for instance: 

021? 4" ? + 72« = W 2 

Each equation can be estimated independently of the other, using 
formulas analogous to (7-17). Variable z itself and another variable z x 
may be used as instrumental variables in both equations, or two 
variables %\ % z 2 completely extraneous to the manifest model may be 

7.7. How to select instrumental variables 

In some instances we may have several candidates for the role of 
instrumental variable. The choice is made anew for each equation 
of the manifest model, and the rules are: 

1. If several instrumental variables are needed, they should be 
those least correlated with one another. 

2. The instrumental variables should affect strongly as many as 
possible of the variables present in the equation that is being estimated. 

Choosing instrumental variables is admittedly arbitrary. Another 
statistician with the same data might make a different choice and so 
get different results for the same model. The technique of weighting 
instrumental variables eliminates some of this arbitrariness. I illustrate 
the technique for the single-equation, single-parameter demand model 
p = @q + u. Suppose that two exogenous variables are available, 
z i (the sugar tax) and z<i (the tax on honey), and that both affect p and 
q. To select z\ or z 2 is arbitrary. The new variable z = Wiz x + w 2 Z2, 
a linear combination of the two taxes with arbitrary weights W\ and w 2 , 
is less arbitrary because both taxes are taken into account. 1 

Results improve considerably if we take W\ } w 2 proportional to the 
importance of the two taxes on the sugar market. Naturally, to 
estimate the parameters of the sugar market, the weight given the 
sugar tax should be greater than that given to the tax on honey; and 
vice versa when we want to study the honey market. In general, we 
ought to rank the instrumental variable candidates z h z 2 , Zz, * . . in 
order of increasing remoteness from the sector being estimated and 

1 This treatment with w x « u> 2 » 1 coincides with Theil's method with k m 1. 
Consult Chap. 9 below. 


assign them decreasing weights in a new instrumental variable 
z = w&i + w&2 + w&z +••'.•• The more accurate the a priori 
information by means of whioh weights are assigned, the more does 
this technique approximate 5he results of the full information maxi- 
mum likelihood method, discussed in Chap. 8. 


Warning: These are difficult! 

7.C Prove or disprove the conjecture that weighted instrumental 
variables are "better" than unweighted. Use the model p = Pq -f u, 
(l/yi)p + (I/72)? — to — $i;Z 2 — v, where y h y it 6 h 5 2 measure the 
sensitivity of p and q to Z\ and z 2 . Define W\ and tt? 2 as 6i/(5i -f 5 2 ). 

m *tQ 


Prove e{/Sf(*i) - Hfri)]) 1 > «WW - ttfMll 1 < «WW - «(£<*«>] I f - 

7.D Prove or disprove the conjecture that the goodness of the 
weighted instrumental variably technique is insensitive to small depar- 
tures of W\ t w t ^om their ideal relative sizes 5i/(5i + 62), 62/(61 + 5 2 ). 


Limited information 

8tl« Introduction 

Limited information maximum likelihood is one of the many tech- 
niques available for estimating an identified (exactly or overidentified) 
equation. Other methods are (1) na'ive least squares, (2) least squares 
applied to the reduced form, (3) instrumental variables, (4) weighted 
instrumental variables, (5) TheiFs method, 1 and (6) full information. 

Method 1 is biased and inconsistent; the rest are biased and con- 
sistent. They are listed in order of increasing efficiency. Limited 
information leads to burdensome computations, but is less cumbersome 
than full information. Unlike full information but like all other 
methods, limited information can bo used on one equation of a model 
at a time. Limited information differs from the method of instru- 
mental variables in two ways: it makes use of all, not an arbitrary 
selection, of the exogenous variables affecting the system; it prescribes 
a special way of using the exogenous variables. If an equation is 
exactly identified, limited information and instrumental variables are 
equivalent methods. Like all methods of estimating parameters, 
limited information uses formulas that are nothing more than a 

1 Discussed in Chap. 9. 



glorified version of the quotient 

Change in effect (s) 

Corresponding change in cause(s) 

I shall illustrate by the example of (8-1), where the first equation ii l@ 
be estimated by the limited information method. The rest of !h§ 
model may be either latent or manifest. The limited information 
method ignores part of what goes on in the remaining equations by 
deliberate choice, not because they are latent (though, of course, they 
might be). However, in (8-1) the entire model is spelled out for 
pedagogic reasons. The minus signs are contrary to the notations! 
conventions used so far but ssre very handy in solving for ff\, i/s» y%» 
Nothing in the logic of the situation is changed by expressing B and T 
in negative terms. 

2/i - ft/2 - 7*1 - Wi 

2/2 - 0232/3 - 72^2 - 723^3 - 78*24 * «2 (84) 

~ 0312/1 -f 1/3 - 731*1 - 734^4 « U% 

As usual, a single asterisk distinguishes the variables admitted in 
the first equation and a double asterisk those excluded from it. Thus 
y* = vec (2/1,2/2), z* = vec (zj, y** ■■ vec (1/3), z** - vec (zi,fi|f«).. 

We apply the limited information method in two cases: 

1. When nothing more is known about the economy than that some- 
how z 2 , Z3, Zi affect it 

2. When more is known bub this is purposely ignored 

8.2. The chain of causation 

Let the first equation be the> one we wish to estimate, out of a model 
containing several. The chains of causation in a general model of 
several equations in several unknowns are shown in Fig. 15. The arrow- 
heads show that causation flows from the z's to the y'a but not baok> 
and mutually between the y's. Solid arrows correspond to the ?&$% 
equation, broken arrows to the rest of the model. The two left-hand 
rounded arrows, one solid, one broken, show that the y*'a (the endoge- 
nous variables admitted in the manifest model) interact both in the 
first equation and (possibly, tDo) in the rest of the model. The right- 



Fig. 15. Causation in econometric models. 

Fig. 16. Chain of causation in the special model (8-1). 

hand rounded arrow shows that the y**'s (the endogenous variables 
excluded) interact, but, naturally, only in the rest of the model. 

Parenthetically, the crinkly arrows symbolize intercorrelation among 
the exogenous variables. Ideally, the exogenous variables are uncon- 
nected, but in any given sample they may happen to be intercorrelated. 
This is the familiar problem of multicollinearity (Sec. 6.14) in its 


general form. The stronger the; correlation between one z and another, 
the less reliable are estimates of the 7s because different exogenous 
variables have acted alike in ithe sample period. We shall ignore 
multicollinearity and continue with the main subject. 

Figure 16 shows the chains of causation in (8-1). The variable %% 
affects ?/i and t/ 2 in the first equation and y\ and yz in the third. 
Variables z 2 , 23, zk affect t/ 2 and v/ 3 in the second, and z 4 affects 7/1 and ijt 
in the third. We can make the arrows between the y's single rather 
than double-headed because there are as many equations as there are 
endogenous variables. Thus tlie model can be put into cyclical form. 1 
It so happens that (8-1) is already in this form; that is, given a 
constellation of values for the exogenous variables and the random 
disturbances, if we give yz an arbitrary value, then y 3 determines y%^ 
which in turn determines y h which in turn affects y z , and so round and 
round until mutually compatible values are reached. 

8.3. The rationale of limited information 

The problem of estimating thte model of Fig. 16 can be likened to the 
following problem. Suppose that z h z 2 , zz, zk are the locations of four 
springs of water and that ?/i, y%, 1/3 are, respectively, the kitchen tap, 
bathtub tap, and shower tap of k given house. The arrows are pipes or 
presumed pipes. Estimating the first equation is like trying to find 
the width of the pipes between z\ and yi and y% and of the pipe between 
2/ 2 and y\. The width is estimated by varying the flows at the four 
springs z\, z 2 , 23, zk and then measuring the resulting flow in the kitchen 
(2/1), bathtub (y 2 ), and shower (2/3). Limited information attempt! to 
solve the same problem with the following handicaps arising either from 
lack of knowledge or from deliberate neglect of knowledge: 

1. Pipes are known to exist for certain only where there &r© £olid 
arrows (7, j3). 

2. It is known that z 2 , Zz, z* enter the flow somewhere or other ? but 
it is not known where. 

3. It is not known whether there is another direct pipeline (*y«) 
from z\ to the kitchen (2/1) and bathtub (?/ 2 ). 

4. The flow at the shower (2/3) is ignored even if it is measurable. 
1 Note carefully that the cyclical and the recursive are different forms. 


So as not to fill up page upon page with dull arithmetic, I am going 
to cut model (8-1) drastically by some special assumptions, which, I 
vouch, remove nothing essential from the problem. The special 
assumptions are: £ = l.G, 7 = 0.5, 731 — 734 = 0, 1831 = 0.1, /3 2 3 = 2; 
72222 -f 723Z3 + 72424 is combined into one simple term = 722**; 
72 = 0.5. Then (8-1) collapses to 

t/i -- Qy 3 - yzX = Mi 

2/2 - £232/3 - 722** = u 2 (8-2) 

- 0312/1 +2/3 = M 3 

and Fig. 16 collapses to Fig. 17. Now let us change metaphors. 

Fig. 17. Another special case. 

Instead of a hydraulic system, think of a telephone network. The 
coefficients 0, 7, if greater than 1, represent loud-speakers; if less than 
1 , low-speakers. Where a coefficient is equal to 1, sound is transmitted 
exactly. To avoid having to reconcile conflicting observations, assume 
that all the disturbances are zero, i.e., that there is neither leakage of 
sound out of nor noise into the acoustic system of Fig. 17. 

Here is how the estimating procedure works. Begin from a state 
of acoustical equilibrium, and measure the noise level at each point of 
the network. Then step up the sound level at z** by 100 units. 
Only 50 of these reach location 2/2, because there is a twofold low-speaker 
(72 = 0.5) between z** and 2/2. Also step up the sound level at z* by, 
say, 10 units. Only 5 units (7 = 0.5) get to 2/1. But, whatever extra 


noise there is at y\, one-tenth bi it (08i = 0.1) reaches y%. From y% a 
loud-speaker doubles the increment as it conveys it to y%, whence some 
gets to Vh and so on. By differencing (8-2) and solving for Aw ■ 0, 
Az* = 10, and Az** = 100, the ultimate increments are found to be 
A?/i = 125, At/2 = 75, Ay z — 12.5. Now, suppose we did not know how 
strong was the low-speaker connection /3 between t/i and y* By 
differencing (8-2), we get j 

* ' Ayi- 7 As* . . 

When the model is exact, it takes exactly five observations to determine 
0, 7, 72, fe, 03i. When the model is stochastic, there are complica- 
tions, but the basic appearance of the formula is not much different. 
The numerator can be interpreted as that change in the sound level y\ 
not attributable to what is coming over the line from z*, that is to say, 
only the sound that comes from y 2 and ?/ 3 . The denominator measures 
the increment at y-i resulting from two sources z** and y*. The limited 
information method just ignores the latter source entirely. This is so 
because both /3 2 3 and 3 i belong to the "rest of the model" and are 
neither specified nor evaluated. So, (8-3) is interpreted as follows: 

_ variation ;.n y* not due to any z* , R .. 

variation in y\ from all sources 

The limited information method suppresses /? 2 3 and /3 3 i and estimates 

__ A?/i — 7 Az* _ variation in y\ not due to z* ,~ -. 

72 Az** ~ variation in y* due to z** 

Notice carefully that the method suppresses only 2 3, 03i, that is, the 
latent model's intervariation of the endogenous variables. It does not 
suppress 72, i.e., the variation (in the latent model) due to the exogenous 
variables z**. I 

8.4. Formulas for limited information 

This section shows that the lengthy formulas for computing limited 
information estimates are just fancy versions of (8-3). It can safely 
be skipped, for it contains no new ideas. To obtain estimates of the 



#s of the first equation, combine the moments of the variables as in 
the following list: 

In general In model (8-1) 

1. Construct 

C m m y ««(nO- 1 m 1! y. *n(vLV,Kn,$t .*..««) ' {n*(«i,«,.t,.i4) 2 )- 1 

2. Construct 

D - m y vtm BV }- , m a v ™ ,,,,, • (m,,,,)" 1 • (m flVl m tlVt ) 

3. Construct 

W SS **l y » y « ~ C 

4. Compute 1 

V = < 

5. Compute 2 

Q _ V- J W 

6. The estimate of the /9s of the first equation Is a nontrivial solution 
(called the eigenvector) of Q. 

7. Having computed (J, one can calculate f , u, and estimates of the 
covariances of the disturbances and the parameter estimates. 

In steps 1 and 2 above, the factors m z « z « and m za5 and also the z and 
z* in the remaining moment matrices play a role analogous to the 
weights cj/2c? in the least squares technique. 8 They just provide a 
method for reconciling the conflicting observations generated by the 
nonzero random disturbances. 

The matrix m y . y . corresponds to the pair of round arrows about y* 
in Fig. 15. 

Essentially, Q is an estimate of the 0s of the first equation. Q can be 
interpreted as a quotient, because the matrix operation V -1 W reminds 
one of the ratio of two numbers: W/V. Actually, this impressionistic 
intuition is quite correct. W corresponds to an elaborate case of Ae 

1 Klein calls this B instead of V. I use V to avoid confusion with the B of 
By + Tz — u. 

2 Klein calls this A. I use Q to avoid confusion with the A of the model 
Ax « u. 

3 Compare with Sec. 7.2. 


(the change in the effects), and V is an elaborate case of Ac (the cor- 
responding change in the causes). Indeed W and V are complicated 
cases of the numerator and denominator of (8-5). W is interpreted as 
the variation of the endogenous variables not due to any exogenous 
changes, and V expresses the variation of the endogenous variables 
from all sources exogenous to any equation of the model and endoge- 
nous to the manifest part. 

8.5. Connection with the instrumental variable method 

Limited information recognizes that exogenous influences not present 
in the first equation influence the course of events. The instrumental 
variable method acknowledges the same thing. Limited information 
makes use of all these exogenous influences, whereas the instrumental 
variable method (generally) picks from among them, either hap- 
hazardly or according to the principles of Sec. 7.7. 

When the first equation is exactly identified, picking is impossible 
and the two methods coincide. 

8.6. Connection with indirect least squares 

The limited information method can also be interpreted as a form of 
modified indirect least squares or as a generalization of directional 
least squares (see the Digression in Sec. 4.4). The direct or naive 
least squares method estimates /? essentially as the regression coefficient 
of y\ on 2/2. Haavelmo's proposition (Chap. 4) advised us to minimize 
square residuals in the northeast-southwest direction in order to allow 
for autonomous variations in the exogenous variable, investment z t . 
In (8-1) there are several such exogenous variables z h z 2 , «3, za which 
generate in the 2/12/2 plane a scatter diagram which is a weighted 
average of lozenge -shaped figures (as in Fig. 9), one for zi, one for z it 
and so on. In matrix C (and, hence, in W and V) this weighted 
averaging has taken place. 

Further readings 

Hood, chap. 10, describes in detail how to compute limited information 
and other types of estimates, and illustrates with a completely worked out 
macroeconomic model of Klein's. 


The family of simultaneous 
estimating techniques 

9.1. Introduction 

We owe to Theil 1 a theorem showing that all the estimating tech- 
niques of Chaps. 4 to 8 are special cases of a new technique, which has 
the further merit of being fairly easy to compute. Section 9.2, which 
covers this ground, is addressed primarily to lovers of mathematical 
generality and elegance; other readers might skip or skim. 

The other sections of this chapter reconsider underidentification and 
overidentification from the point of view of research strategy. Section 
9.3 accepts models as given (over-, under-, or exactly identified) and 
suggests alternative treatments. Section 9.4 raises the issue of 
whether econometric models can be anything but underidentified. 

9*2. Theil's method of dual reduced forms 

This method can be applied to all equations of a system, one at a 
time. The equation we want to estimate, called the "first" equation, 

1 Reference in Further Readings at the end of this chapter. 


9.2. theil's method op dual reduced forms 127 

comes from a complete system, for instance, (8-1). We know and can 
observe all the exogenous variables affecting the system, and we also 
know a priori which variables (endogenous and exogenous) enter the 
first equation. The other equations may be identified or not. The 
disturbances have the usual Simplifying Properties. Any endogenous 
variable of the first equation can be chosen to play the role of dependent 
variable. We shall use 2/1 in this role. The remaining variables of the 
first equation, namely, 2/2, . . . , ya*\ Z\ % . . . , z//», must all be differ- 
ent in the sample; that is to say, they must not behave as if they were 
linear combinations of one another. We do not need to know or 
observe the endogenous variables 2/g*+i, . . . , ya not present in the 
first equation. 

Let one star, as usual, represent presence in the first equation, and 
two stars, absence from the first equation. 

We then form two reduced forms whose coefficients we calculate by 
simple least squares: (1) y* on z* with parameters f and residuals v; 
and (2) y* on z = (z*,z**) with parameters p and residuals w. For 
instance, to estimate the first equation of (8-1), compute 

2/i = tfnZi + h 2/1 - Pn*i + P12Z2 + P13Z3 + P14Z4 + Wi 
2/2 = #2iZi -f v% 2/2 = P21Z1 + P22Z2 4- P23Z3 + P24Z4 + Wt 


The right-hand set in (9-1) is necessary for estimating the first and 
useful for estimating the other equations of (8-1). Let us omit the 
bird ( v ) where it is obvious. 

Next, we compute the moments of the residuals on one another and 
construct two new matrices D(fc) and N(&): 

U\k) sb m.( v ,vo*;*» 'b*)(vv ...V0s*i «aO 

; — fcn\( W \vdq+, 0, . . . , 0)(wj wo*; c. . » , . o.» 

JM^/cy = m^, vo*\'\ *a«)*i/j j A?lIi( W| wq*\ 0)'U>! 

where A; is a variable that will be defined below. Then the estimates 
of the j8s and 7s of the first equation are given by 

est fa . . . ,/3 G *,7i| . . . ,y H *) - [D(*)]-W(*) (9-2) 

Theil has proved that, if Jc = 0, then (9-2) gives the naive least 
squares estimate with 2/1 treated as the sole dependent variable. If 
h *■ 1, (9-2) gives the method of unweighted instrumental variables 


of Sec. 7.7. If k = 1 -f- v, where v is the smallest root of 

del [m( t , »o»)(»i.... »<?•) — (1 + ^tao^i wo*)(u>i w *)l = (9-3) 

then the estimates of (9-2) are identical with the limited information 
estimates of Chap. 8. All these estimates except for the k = case 
are consistent, but biased in the /3s. In the case k — 1, the bias itself 
can be estimated and corrected for. 1 

These findings not only are exciting for their beauty and symmetry, 
but are practical as well. The regressions (9-1) are straightforward 
and attainable by simple calculation (see Appendix B) even for large 
systems. The solution of (9-3) is not too hard, since the number G* 
of present endogenous variables seldom exceeds 3 or 4 in any actual 
models. But (9-3) must be calculated over again if we decide to 
estimate the second or third equation of the original model. Theil 
states that his technique works if the remaining equations of the 
system are nonlinear and that it works for large samples even when 
some of the z's are lagged values of some y. 

9.3. Treatment of models that are not exactly identified 

This section gives advice on how to treat models that in their 
natural state contain some underidentified or some overidentified 
equations, or both. The alternatives are listed from the most desirable 
to the least desirable, disregarding the cost of computation. 

If a model contains some underidentified equations, we need do 
nothing about them unless we wish to estimate them. The remaining 
equations, if identified, can be estimated in any case. 

If we wish to estimate the underidentified equation, we must make 
certain alterations: 

1. Make it identified by bringing in parameter estimates from 
independent sources, say, cross-section data. There are pitfalls of a 
new kind in this method, however, which are noted briefly in Chap. 12. 

2. Identify the equation in question by strategically adding variables 
elsewhere in the model. This process, however, might de-identify the 
rest of the model. 

3. Go ahead and estimate the underidentified equation; then, if you 
have a priori information on covariances, perform the tests of Sec. G.9 

1 Compare with Appendix B. 


to detect (or try to detect) whether you have estimated a bogus 




If, on the other hand, the model contains some overidentified 

1. Use the full information, maximum likelihood method. This will 
yield consistent and efficient estimates of the identifiable parameters. 

2. Use the limited information, maximum likelihood method, 

3. Use instrumental variables, weighted. 

4. Use instrumental variables, unweighted. 

5. In the given equations, add variables where they are most relevant 
in such a way as to remove the overidentification. 

6. Enlarge the system by endogenizing a previously exogenous 

7. In the original overidentified model, remove the overidentification 
by introducing redundant variables in the other equations. If it 
turns out that the redundant variable has a significant parameter, you 
have succeeded. 

8. Drop variables to remove the overidentification. Instead of 
outright dropping, you may linearly combine two or more such 
variables. This cannot always be done, because the combined 
variables are not always present together or absent together elsewhere 
in the model. 

9. Use the reduced form, and select arbitrarily one of the several 
sets of alternative estimates. 

Underidentification is a more serious handicap than overidentifica- 
tion. To remove the former you have to make material alteration! in 
the model. To remove the latter you can always use the full informa- 
tion method. 

Whatever the final alterations, I would begin by constructing my 
models without worrying about identification. In doing so, I sm sure 
that I am acting in the light of my best a priori wisdom, givcm the 
objectives of my study and my computing budget. If it turns out 
that identification makes alterations necessary, I think that honesty 
requires me to keep a record; of the identifying alterations. Like 
Ariadne's thread, this record keeps track of my search for a second 
best; I may want to give up in frustration and return to try another 
way out of the Minotaur's chamber. 


9.4. The "natural state" of an econometric model 

Econometricians have devoted a good deal of attention to over- 
identified models. This entire book, from Chap. 6 on, is devoted to 
developing various approximations 1 to the full information method, 
which everybody tries to avoid because of its burdensome arithmetic. 
According to Liu, 2 we have been wasting our effort, because all well- 
conceived econometric models are in truth necessarily underidentified: 

In economic reality, there are so many variables which have an important 
influence on the dependent variable in any structural equation that all 
structural relationships are likely to be "underidentified." 

So Liu would not use any of our elaborate techniques, but would 
estimate just the reduced form and do so by simple least squares. 
The reduced form is to include as many exogenous variables as our 
knowledge and computational patience permit. Liu would then use 
these estimates for forecasting, and claims that they forecast better 
than all other techniques. 

These subversive ideas deserve careful consideration. Is it true 
that structural equations in their natural, unemasculated, noble- 
savage state are underidentified? If they are, in what sense are 
forecasts from the reduced form better? 

To begin with, there are occasions in which the investigator does 
not care to know the; values of the structural parameters and is content 
with some kind of reduced form. To illustrate one occasion of this 
sort, assume that the investigator 

1. Works from a typical and large enough sample 

2. Forecasts for an economy of fixed structure 

3. Forecasts from exogenous variables that stay in their sample 

Under the above conditions, an investigator would be glad to work 
with a ready-made reduced form though not necessarily with parame- 
ters estimated by simple least squares. He would accept the latter 
if justifiable, not for want of anything better. 

1 Unweighted and weighted instrumental variables and limited information. 
* Ta-Chung Liu, "A Simple Forecasting Model for the U.S. Economy," p. 437 
(International Monetary Fund Staff Papers, pp. 434-466, August, 1955). 


Are econometric models necessarily underidentified? Admittedly, 
it is an oversimplification, as Liu states, 1 to impose the condition that 
certain variables be absent from a given structural equation. But it is 
gross "overcomplification" — to coin a much-needed word—to impose 
no condition at all, inviting into the demand for left-handed, square- 
headed J/^-inch bolts (and on equal a priori standing with the price of 
steel) the average diameter of tallow candles and the failure or success 
of the cod catch off the banks of Newfoundland. My instinct advises 
me to go halfway concerning these new variables: neither leave them 
out altogether nor admit them as equals. Consider the model 

q + ap + yr + 5/= u 
$q+ p =v C?M) 

consisting of one underidentified and one overidentificd equation. 
Now, if r and / are admitted as equals in the second equation, with 
parameters of their own, the whole system becomes underidentified. 
But the very knowledge that first convinced us to leave them out of the 
second equation now advises us to tack them on with a priori small 
parameters, small relative to 0, ?, etc. A reasonable restatement 
might be the following: 

q + ap + yr + 8f - u 
(3q+ p+j(3r + k8f~v v *"°' 

where j, h are small constant:?, say Kooo> Koo> or some other not 
unreasonable value. And now (wonder of wonders!) both equations 
have become identified. The trick does not always work. For 
instance, it does not help in 

pq -r p = w 2 

to fill the hole with kar, nor k(3r, nor kyr because we still have three 
parameters (a, 0, 7) to estimate and the reduced form contains only 
two coefficients 7F1 = m qr /m rr , ih — m pr /mrr. However, if the supply 
of exogenous variables is less niggardly than in (9-6) it is not hard to 
find reasonable ways to complete a model so as to identify it in it3 
entirety, if we so desire. 

The most difficult and dangerous step is the assigning of values to 

1 Ibid., p. 405. ; 


j and k. The values must have the correct algebraic sign; otherwise, 
structural parameters are wildly misestimated. If the correct magni- 
tudes forj and k are unknown, it is better to err on the small side than 
on the large. Too small (positive or negative) a value of j is better 
than a hole in the equation, but too large a value may be worse than a 

9.5. What are good forecasts? 

If we want to forecast from an underidentified model, we have no 
choice but to use some kind of reduced form; from an overidentified 
model, it is convenient, not compulsory, to work from a reduced form. 
The entire question in both cases is: What sort of reduced form? 
How ought we to compute its coefficients? 

To pin down our ideas, we shall consider the model By -f Tz = u, 
where u has all the Simplifying Properties; in addition we shall make 
the covariances a UgUhi known fixed constants, possibly all equal, so as to 
keep them out of the way of the likelihood function. This way we 
concentrate attention on the structural parameters /?, 7, and x and their 
rival estimates. The reduced form is y ** riz + v, where n = — B _1 r, 
v = B _1 u. The reduced form contains the entire set of exogenous 
variables whether the original form is exactly, over-, or underidentified. 

Maximum likelihood minimizes 


by the jSs and 7s; limited information and instrumental variables 
approximate this. The naive reduced form advocated by Liu 



by the ts (whatever these may be). Naturally, the two procedures 
are not equivalent, and, naturally, the second guarantees that residuals 
will be forecast with minimum variance. 1 But what is so good about 
forecasting residuals with minimum variance? The forecasts themselves 

1 Provided the sample and structure conform to conditions 1 to 3 of Sec. 9.4. 



in both cases are (in general) biased, but the forecasts by maximum 
likelihood have the greater probability of being right. 

In Fig. 18, p is the course of future events if no disturbances occur. 
The curve labeled p shows the (biased) probability distribution of the 
full information, maximum likelihood estimate of p; it is in general 
biased (ep s* p) but has its puak at p itself. Curve p m another 
maximum likelihood estimate (.say, instrumental variables or limited 
information) ; it too has a peak at p but a lower one, perhaps a different 
bias £p, and certainly a larger variance than p. The reduced-form 
least squares estimate is distributed as in curve p; naturally it has a 

Ep Ep 

Variat le and its forecasts 
Fig. 18. The properties of forecasts, p: the true value of the forecast variable 
under zero disturbances, p: reduced-form least squares estimates, p: full- 
information maximum likelihood estimates, p: other maximum likelihood 

estimates. ; 

smaller spread than p and p; it may be more or less biased than either; 
but its peak is off p. 

To put this into words: If, in the postsample year, all disturbances 
happen to be zero, maximum likelihood estimates forecast perfectly, 
and least squares forecast impc rfectly. If the disturbances are non- 
zero, both forecast imperfectly; but, on the average and in the long 
run, least squares forecasts are less dispersed around their (biased) 

Which criterion is more reasonable is, I think, open to debate. I 
favor maximum likelihood estimates for much the same reason that I 
accept the maximum likelihood criterion in the first place: If we arc to 
predict the future course of events, why not predict that the most 
probable thing (u = 0) will happen? What else can we sanely 


assume — the second most probable? On the other hand, if my job 
depends on the average success of my forecasts, I shall choose the 
least biased technique and disregard the highest probability of particu- 
lar instances. If I want to make a showing of unswerving, unvacil- 
lating steadfastness, I shall use the least squares technique on the 
reduced form, even though it steadfastly throws my forecasts off the 
mark in each particular instance and in the totality of instances. 

Further readings 

The reference for Sec. 9.2 is H. Theil, "Estimation of Parameters of Econo- 
metric Models" {Bulletin de Vinslitut international de statistique, vol. 34, pt. 2, 
pp. 122-120, 1954). It is full of misprints. 

Extraneous estimators are illustrated in Klein, chap. 5, where ho pools 
time-series and cross-section data. Their statistical and common-sense diffi- 
culties are discussed in Edwin Kuh and John R. Meyer, "How Extraneous 
Are Extraneous Estimates?" (Review of Economics and Statistics, vol. 39, 
no. 4, pp. 380-393, November, 1957). 

Tinbergen, pp. 200-204, discusses the advantages and disadvantages of 
working from a reduced form, but overlooks that its least squares estimation 
is maximum likelihood only for an underidentified or exactly identified 

Ever since Haavelmo, Koopmans, and others proposed elaborate methods 
for correct simultaneous estimation, naive and not-so-naive least squares has 
not lacked ardent defenders. Carl F. Christ, "Aggregate Econometric 
Models" [American Economic Review, vol. 46, no. 3, pp. 385-408 (especially in 
pp. 397-401), June, 19.56], claims that least squares forecasts are likely to be 
more clustered than other forecasts; and Karl A. Fox, "Econometric Models 
of the U.S. Economy" (Journal of Political Economy, vol. 64, no. 2, pp. 128- 
142, April, 195G), has performed simple least squares regressions using the 
data and form of the Klcin-Goldberger model (for reference, see Further 
Readings, chap. 1). See also Carl F. Christ, "A Test of an Econometric 
Model of the United States 1921-1947" (Universities-National Bureau Com- 
mittee, Conference on Business Cycles, New York, pp. 35-107, 1951), with 
comments by Milton Friedman, Lawrence R. Klein, Geoffrey H. Moore, and 
Jan Tinbergen and a reply by Christ, pp. 107-129. In pp. 45-50 Christ 
summarizes the properties of rival estimating procedures. E. G. Bennion, 
in "The Cowles Commission's 'Simultaneous Equations Approach': A Sim- 
plified Explanation" (Review of Economics and Statistics, vol. 34, no. 1, pp. 
49-56, 1952), illustrates why least squares gives a better historical relation- 
ship and better forecasts (as long as exogenous variables stay in their 
historical range) than do simultaneous estimates. John R. Meyer and Henry 
Laurence Miller, Jr., "Some Comments on the 'Simultaneous-equation 



Approach* " (Review of Economics and Statistics, vol. 36, no. 1, February, 
1954), state very clearly the different kinds of situations in which forecasts 
have to be made — and to each corresponds a proper estimating procedure. 

Herman Wold says that he wrote Demand Analysis (New York: John Wiley 
& Sons, Inc., 1953) in large part to reinstate "a good many methods which 
have sometimes been declared obsolete, like the least squares regression or 
the short-cut of consumer units in the analysis of family budget data" and 
to "reveal and take advantage of the wealth of experience and common sense 
that is embodied in the familiar procedures of the traditional methods" (from 
page x of the preface). He believes that the economy is in truth recursive 
and that it can be described by recursive models whose equations, in the 
proper sequence, can be estimated by least squares. His second chapter, 
entitled "Least Squares under Delate" (especially sees. 7 to 9), is very far 
from convincing me that ho is right. 


Searching for hypotheses 
and testing them 

10.1# Introduction 

Crudely stated, the subject of this chapter is how to tell whether 
some variables of a given set vary together or not and which ones do so 
more than others. The problem is how to make three interrelated 
choices: (1) a choice among the variables available, (2) a choice among 
the different ways they can vary together, and (3) a choice among 
different criteria for measuring the togetherness of their variation. 
The whole thing is like a complicated referendum for simultaneously 
(1) choosing the number and identity of the delegates, (2) deciding 
whether they should sit in a unicameral or multicameral legislature, 
and (3) supplying them with rules of procedure to use when they go 
into session. 

This triple task is too much for a statistician, as it is for a citizenry: 
it wastes statistical data, as it wastes voters' time and attention. 
Just as, in practice, people settle independently, arbitrarily, and at a 
prior stage the number of chambers, the number of delegates, and the 
rules of procedure, so the statistician uses maintained hypotheses. 




For example, in the model C< = ja + yZ t + w« of Chap. 1, the presence 
of one and not two equations,! two and not four variables, all the 
remaining stochastic and structural assumptions, and the requirement 
for maximizing likelihood are the maintained hypotheses. Only rival 
hypotheses about the true parameter values a and 7 remain to be 
tested. The entire field of hypothesis searching and testing consists of 
variations on the above theme.' The maintained hypotheses can be 
made more or less liberal, or they may change roles with the questioned 
hypotheses. Section 10.4 lists many specific examples. 

The general moral of this chapter is this: Having used your data to 
accept or reject a hypothesis while maintaining others, you are not 
supposed to turn around, maintain your first decision, and test another 
hypothesis with the same data.i If you are interested in testing two 
hypotheses from the same set c'f data, you must test them together. 
Thus, if you want to find both the form and personnel of government 
preferred by the French, you should ask them to rank on the ballot all 
combinations (like Gaillard/unicameral, Gaillard/bicameral, Pinay/ 
unicameral, Pinay/bicameral) and to decide simultaneously who is to 
lead and which type of parliament; not the man first and the type 
second; not the type first and the man second. 

Everything that follows in this chapter pretends that variables are 
measured without error. Sections 10.2 and 10.3 introduce two new 
concepts: discontinuous hypotheses and the null hypothesis. Sections 
10.4 to 10.8 explore some of the! commonest hypotheses considered by 
econometricians, especially wheii they set about to specify a model. 

10.2. Discontinuous hypotheses 

Consider again the simple model C t = a + yZ t 4- u t . The rival 
hypotheses here are alternative Values of ct and y and may be any pair 
of real numbers. This is an example of continuity. 

Now consider this problem: ]|3oes x depend on 1/, or the other way 
around? Taking the dependence (for simplicity only) to be linear and 
homogeneous, the rival hypotheses here are 

x t = yyt + u t versus y t — Bx t + v t 

The answer is yes or no; either the first or the second equation holds* 
This is an example of discontinuity. However, tne further problem of 


the size of y (or 5), by itself, may be a continuous hypothesis problem. 
Many of my examples below (Sec. 10.4) are discontinuous. The 
simple maximizing rules of the calculus do not work when there is 
discontinuity, and this fact makes it very interesting. 

10.3. The null hypothesis 

In selecting among hypotheses we can proceed in two ways: (1) 
compare them all to one another; (2) compare them each to a special, 
simple one, called the null hypothesis (symbolized by H ). An example 
of the first procedure is the maximum likelihood estimation of (#,7) 
in the model C t = a + yZ t + u t , since it compares all conceivable 
pairs (a,y) in choosing the most likely among them. The other way to 
proceed is somewhat as follows: select a null hypothesis, for example, 
a = 3 and y = 0.7, and accept or reject it (i.e., accept the proposition 
11 either a j^ 3 or 7 ^ 0.7, or both") from evidence in the sample. I 
have more to say later on about how to select a null hypothesis and 
what criteria to use for accepting or rejecting it. Meanwhile, note that 
the decision to proceed via null hypothesis has nothing to do with 
continuity and discontinuity, though it happens that many applica- 
tions of the null hypothesis technique are in discontinuous problems. 

10.4. Examples of rival hypotheses 

Many of the examples in this section are linear and homogeneous for 
the sake of simplicity only ; in these cases linearity (and homogeneity) is 
guaranteed not to affect the principle discussed. In other examples, 
however, linearity (or homogeneity) is a rival hypothesis ana thus 
very much involved in the principle discussed. Now to the examples: 

1. Which one variable from a given set of explanatory variables is 
best? For instance, should we put income, past income, past con- 
sumption, or age in a rudimentary consumption function? The rival 
hypotheses here are 

C t = 0Y, + u t C t = yY t -i + u t C t '= tCt-i + u t etc. 

2. Should the single term be linear or quadratic, logarithmic, etc.? 
The rival hypotheses here are 


& = 0Y t + u t C t = yY* f u t C t - 6 log Y t + w« etc. 

Note that this becomes a special case of example 1 if we agree that 
7 2 , log Y, etc., are different variables from Y (Sec. 10.9). 

3. What value of the single parameter is best? In C t = &Y t + u t 
the rival hypotheses are different values of /?, say, £ = 1, /3 = H, 
jg = %, and others. This, too, is a special case of example 1, because 
it can be expressed as a choice among the explanatory variables 7, 27, 
47/3, respectively. 

4. Should there be one or more equations in the model? This ques- 
tion, important when several variables are involved, lurks behind the 
problems of confluence (see Sec. 6.14), but it arises even with two 

variables. I 

The above examples generalize, naturally. For instance, the 

question may be which two or which three variables to include, which 

linearly, which nonlinearly, how many lags, and how far back. 

5. Which variables are to be regressed on which? The rival 
hypotheses are 

Xi = ax 2 + u versus x 2 « 0xi 4* v 

for two variables. If we maintain the hypothesis of three variables in 
a single equation, the rival hypotheses are 

Xi = aXi -f- 0x 3 + u versus x 2 ■ 7x1 + 8x 3 + v 

versus x 3 « %X\ + f$§ 4- w 

And, if we maintain three variables and two equations, i\w rival 
hypotheses become 

xi = ax% + px s 4- u xi = €X 2 4- fx 3 + w 

1 » i versus , n . , 

X2 « t^i 4- 5x 3 4- v X3 = rjxi 4- 6x2 4- 1 

j x a « kXi 4" X#i + I 

versus , 1 

X3 = /*Xi 4- »*X| 4* F 

and so on for more equations and more variables. This is typically a 
discontinuous problem. It is discussed briefly in Sec. 10.8. 

6. Having decided that xi is; an explanatory variable, does it help 


to include #2 as well? The rival hypotheses are 

y « axi + v versus y = fax -f yx* + w 

Section 10.8 contains hints on this problem. 

7. Having decided to include x h which one other variable should be 

y = ax\ -f fat -+• u versus y = yx\ + 6x3 + v etc. 

Section 10.8 applies to this problem. 

8. Is it better to have a ratio model or an additive one? 

c v 

- s=* a - + u versus c = &y + yn -f* v 

This is discussed in Sec. 10.10. 

9. Is it better to have a separate equation for each economic sector 
or the same equation to which is added a variable characterizing the 
sector? For example, consider the following rival demand models: 

q « ap + u for the poor 

a . r lL . 1 versus q = yp + ty + w 
g « 0p -h t> for the rich v ^ * 

where y is income. Section 10.11 discusses this problem. 

10. (A special case of the above.) Are dummy variables better than 
separate equations? 

q = ap -f u in wartime q = yp + SQ + w 

q = /3p + v in peacetime Q = in peacetime 

Q = 1 in wartime 

This problem is a special case of the example discussed in Sec. 10.11. 

11. Do variables interact? That is to say, does the size of one or 
more variables fortify (or nullify) the others' separate effects? For 
instance, if being stupid and being old (the variables s and a, respec- 
tively) are bad for earning income, are stupidity and old age in com- 
bination worse than the sum of their separate effects? The rival 
hypotheses are 

y = as 4- fa + w versus y = 73 + 8a -f- csa -f- v 


and can also be expressed as follows: 

y = 75 + da + «sa 4* v Null hypothesis: € = 
or as follows: 

y = as + fi + w for the young ' .' ' a « f 

• * i r iu ri Null hypothesis: _ . 

2/ = 73 -{- 5 + v for the old J " = 1 

This case is not spelled out, but the discussion of Sec. 10.6 applies to it. 

This list is not exhaustive. And, naturally, the above questions can 
be combined into complex hypotheses. 

Digression on correlation and kindred concepts 

This is a good place to gather together some definitions and 
theorems and to issue some simple but often unheeded warnings. 
It is also an excellent opportunity to learn, by doing the exercises, 
to manipulate correlations and regression coefficients as wall as 
all sorts of moments. 

Universe and sample. Keep in mind that Greek letters refer to 
properties of the universe and that Latin letters are used to refer 
to the corresponding sample properties. 

Thus, as already explained in the Digression of Sec. 1.1*2, ir*»> 
aw <r vv are population variances and co variances of x and y. The 
corresponding sample quantities 1 are m XX) m xy , m yyt the so-called 
"moments from the sample means," introduced in the game 

- 2 (*. - .*•)<¥. - y°) 


1 To the population covarianccs c xy there correspond two types of sample 
quantities: those measured from the r.iean of the universe, 

qx V - Y (x. - zx){y. - 

where 8 runs over the sample S°; and those measured from the mean of the sample, 
namely m xy . Interchanging q xv and r,i xy does not hurt at all, in general, when the 
underlying model is linear, since m xy is an unbiased, consistent, etc., estimator of 
both q xy and <r xy , etc. There are difficulties in the case of nonlinear models, but 
we shall not go into them here. 


where s runs over the sample 5°. The universe coefficient of cor- 
relation p is defined by 


and the corresponding sample coefficient r by 

Later on we define partial, multiple, etc., coefficients of correla- 
tion. In ail cases, a coefficient of correlation measures the 
togetherness of two and only two variables, though one or both may 
be compounded of several others. This elementary fact is often 

For the sake of symmetry in notation, when handling several 
variables, we shall use x with subscripts: xi f z 2 , x 3t etc. Then we 
write simply p Uj m, mn for p^x*,), rfaxfch w< Xl )(*i), etc. 

Both p xy and r xy range from — 1 to -f-1. Values very near ± 1 
mean that x and y have a tight linear fit like ax + (3y = w, with 
the residuals very small. A tight nonlinear fit like x 2 + V 2 = 1 
does not yield a large coefficient of correlation p x „. What we need 
to describe this fit is p(*«xv»). And similarly for relations like 
ay + |3 log x « n or a?/ 2 + &x 3 « w, we need p(io g *Kv>> Pc^'Xv 1 )* 

10.5. Linear confluence 

From now on until the contrary is stated, I shall deal with linear 
relations exclusively. The discission is perfectly general for any 
finite number of variables, but three are enough to capture the essence 
of the problems with which we shall be dealing. Let the three variables 

X\ number of pints of liquor sold at a ski resort in a day 
Xi number of tourists present in the resort area 
Xz average daily temperature 

We suppose there are one or several linear stochastic relationships 
among some or all of these variables. The least-squares-regression 


coefficients are denoted by a's and 6's, with a standard system of 

Begin with regressions among X\ } Xt, X% } taken two at a time; 
there are six such regressions. In (10-1) below these regressions are 
arranged in rows according to the variable that is treated as if it were 
dependent and in columns according to the variable treated m 

X\ = ai.2 + 612X2 Xi = Oi. 3 + bitXt 

Xi = a<i.\ 4* 621X1 . X 2 = a 2 .3 4- 623X3 (10-1) 

X 3 = a 3 .i + 631X1 Xz = a 32 + 632X2 

In each subscript, the very first digit denotes the dependent variable. 
If there is a second digit before any dot appears, it denotes the inde- 
pendent variable to which the coefficient belongs. Digits after the 
dot (if any) represent the other independent variables (if any) present 
elsewhere in the equation. The order of digits before the dot is 
material, because it tells which variable is regressed on which. The 
order of subscripts after the dot is immaterial, because these digits 
merely record the other "independent" variables. 

The same three variables can be regressed three at a time. There 
are three such regressions: 

Xi — <Zi.23 + 612.3X2 + 613.2X3 

X 2 = a 2 .i3 + 621.3X1 + 623.1X3 (10-2) 

X3 = a 3 .i2 4- 631.2X1 -f- 632.1X2 

As an exercise, consider the four-variable regression 

Xi = ai.234 4" 61:5.34X2 4- 613.24X3 4* 614.23X4 

and fill in the missing subscripts in 

X3 = a__. 4" 6__.__Xi 4* 6 . X2 4- 6 . X4 

Returning to our liquor example, suppose we decide to measure the 
three variables not from zero but from each one's sample mean. If 
primed small letters represent the transformed variables, we know 
that the a's drop out and the 6's remain unchanged. This is so 
because the model is linear. Our relations (10-1) and (10-2) now 

x[ = 612X2 • • '• x' z = 4- 632.1*2 



10.A Prove r<x,)<x,) = *Vx*t'> m r i2, that is to say, that correla- 
tion does not depend on the origin of measurement. 

10. B Prove 

rlj = bifin 
Hint: Use moments. 

This relation says that the coefficient of correlation between two 
variables equals the geometric mean of the two regression slopes we 
get if we treat each in turn as the independent variable. The less 
these two regressions differ, the nearer is the correlation to +1 or — 1. 

10.6. Partial correlation 

Two factors may account for x[, the sale of a lot of liquor: (1) there 
are many people (x\) ; (2) it is very cold (zj). This relation is expressed 

X[ m blMXl + fris^a (10-3) 

But the reason that (1) there are many people in the resort is (a) that 
the weather is cold, and (possibly) (&) that a lot of drinking is going on 
there, making it fun to bo there apart from the pleasure of skiing. 
This is expressed 

x'z = b 2 i.zx[ + 6 2 3- izj (10-4) 

Suppose we wanted. to know whether liquor sales would be correlated 
with crowds in the absence of weather variations. The measure we 
seek is the partial correlation between x[ and x' 2 , allowing for x' z . This 
measure is symbolized by 7*12.3. It is interpreted as follows: 
Define the variables 

y[ = x[ - 613.2^ (10-5) 

vi - 4 - &,m*J " (10-0) 

The j/8 are sales corrected for weather only and tourists corrected for 
weather only. If we have corrected both for weather, any remaining 
covariation between them is due to (1) the normal desire for people to 
drink liquor (the more tourists the more liquor is sold), (2) the possi- 
bility that some tourists come to enjoy drinking rather than skiing (the 
more liquor, the more tourists), and (3) a combination of the first two 


The partial coefficient of correlation is defined by 


10.C Prove r 2 i.3 = ru.*. j 

10.D Prove r\ 2 . 3 - 612.3&21.3.J This is analogous to Exercise 10.B. 
Hint: Substitute (10-6) and (10-7) into (10-3) and (10-4). 
10.E Prove 

— 7*12 — ri3r23 

12,8 (1 r Wl - A,)* 

from definition (10-7) and Exercises 10.C and 10.D. 

10. F Give a common sense! interpretation of the propositions in 
the above three exercises. 

10.G All this generalizes to four and more variables, but notation 
gets wry mossy. Exorcise 10.D generalizes into tho proposition: 
Every (partial or otherwise) [coefficient of correlation equals tho 
geometric mean of the two relevant regression coefficients. So, for 
example, r 2 u . u = 612.34621.34. 

Let r stand for the matrix of all simple coefficients of correlation 
r»y, and let Rij stand for the minor of r»y. Then Exercise 10.E is 

R 11 Ri 

and with four variables 

rim - ii-34 - fii'H - rlva - g^gjj 

and so on for any number of variables, the dimension of R growing 
all tho while, of course 

10.11 Show that rj 2>8 => R\%/R\\Rn holds but collapses into an 
identity when there is no third variable. 

10.7. Standardized variables 

Let us now measure Xi, X 2 , X 3 not only as departures x{, x' 2 , x\ from 
their sample means but also in units equal to the sample standard 


deviation of each. So transformed, the variables are called just 
$h %h **• 

This step is useful in bunch map analysis (see Sec. 10.8). When 
this is done, nothing happens to either the population or the sample 
correlation coefficients, but the regression parameters between the 
variables do change. 


10.1 Prove m XlXt = (m XlXl m XlXl )->%< Xl o <*,'). 

10. J Prove that r^'x*,') «- r (*,)<*,) "■ rn by using Exercises 10. B 
and 10.C. 

10. K Denote the regression coefficients among x[ 9 x' 2f x' s by the 
letter b and the corresponding coefficients among the standardized 
variables x h x 2 , xz by the letter a, with appropriate subscripts. Inter- 
pret an.3, 021.3, Aim; show that they differ in meaning from ai.23, 02-13, 
flan, respectively. 

10.L Show that an » 6it(ffiu/ttiu)*S and, in general, that 
<ty.» = bij,k(mjj/mu)K t 

10. M Show, by using Exercise 10.L, that r*.* ■ flatty.*. 

10.N Show that rn «■ an. This is a very important property, 
which says that regression and correlation coefficients are identical 
for standardized variables. 

10.O Let x" = (Xi - eX,)((r tl )-^. Prove p< x< ")<*/'> = P(x i nx,) f and 

10.8. Bunch map analysis 

Bunch maps are mainly of archaeological or antiquarian interest. 
They seem to have gone out of fashion. Beach (pp. 172-175) gives 
an excellent account of them which I shall not repeat here. I shall 
merely discuss necessary and sufficient conditions under which bunch 
maps help to accept or reject hypotheses. 

Turn to the example of liquor sales, skiers, and cold weather in 
Sec. 10.5. Let xi, rr 2 , x$ be the three standardized variables. Let their 
correlation coefficients be 


1 ria r u 

if - r 2 i 1 r n 

T31 T32 1 . 


1 0.5 0.2 

0.5 1 0.8 
0.2 0.8 1 

Compute the least squares regressions of all normalized variables, two 

at a time: 

X\ = aiiXi Xi = O13X3 

xi » anX\ ...... x 2 = a 2S X3 (10-8) 

X3 «■ O31X1 Xz = O32X2 

and then three at a time: 

X\ = Cti2.3^2 4" 013.2X3 
X2 = 021-3^1 + O23.1X3 
Xz = 031.2^1 + 032.1^2 


Construct now the unit squares, shown in Fig. 19, where marks the 
origin. In each block the horizontal axis corresponds to the inde- 
pendent variable, and the vertical to the dependent. The labels 
below the squares show which lis which. 

Refer now to the first equation Xi = Oi 2 x 2 in (10-8). From Exercise 
10.N, xi = 7*12X2. Imagine a unit variation in *,he independent variable 
x 2 ; then the corresponding variation in Xi, according to this equation, 
is an. Plot the point (l,ai 2 ) in the first block of squares. Then go 
to the symmetrical equation x 2 ~ a 2 iXi, make Xi vary by Axi = l t and 
plot the resulting point (o 2 i,l) in the same block. In a similar way 
fill out the top row of Fig. 19, drawing the beams from the origin. 

In (10-9), first consider the variation in xi resulting from variations 
in x 2 , other things being equal.: We get three different answers from 
(10-9), one per equation: | 

AXi 4 &12-3 AX2 

Axi 4 Ax 2 

; 021-3 

A ^32-1 A 

Axi ~- Ax 2 


Digressing a little, I state without proof that 






A* 2 'i 

1 A* 3 

1 A* 3 


, -First place: dependent variable 
^-Second place: independent variable 

After dot: variable allowed for 
Fig. 19. Bunch maps. 



1 Ax 3 

Therefore we get from (10-10) the three statements that A£i:Az 2 is 
proportional to RuiR'n, to R22-R21, and to —RniRw In the figure 
this is depicted, respectively, by the beams marked ( 12.3 ). The three 
regressions in general conflict both with regard to the slope and with 
regard to the length of the beams. 

Derive the corresponding relations for Axi : Ax 3 and A# 2 : Az 3 . These 
results are plotted in the last two panels of Fig. 19. 




10. P Plot the bunch maps for 

p = 

1 i -0.6 -0.1 
0.6 1 0.6 

0.1 0.6 1 

Scanning Fig. 19 is supposed; to tell us (1) which two variables to 
regress if we want to stick to two of the given three, and (2) whether a 
third variable is superfluous, useful, or detrimental in some very loose 

What do we look for in Fig. 19? Three things: (1) opening or 
closing of bunch maps as you go from the upper to the lower panel, 
(2) shortening of the beams, and (3) change of direction of the bunches. 
There is no simple intuitive way to interpret the many combinations 
of 1, 2, and 3; this is the main reason why statisticians have abandoned 
bunch maps. 

The examples that follow far from exhaust the possibilities. The 
moral of these examples is: To interpret the behavior of the bunch 
maps, you must translate them into correlation coefficients r and 
try to interpret what it means for the coefficients to be related in one 
way or another. But one might as well start with the correlation 
coefficients, bypassing the bunch maps altogether. 

Example 1. The vanishing beam 

What can we infer if beam Ru/Rn shrinks in length? Take the 
extreme case Ru ~ and Ru « 0. These imply rn = *Wn and 
r| 3 s: l, which, in turn, imply r%* «* ±1. and rn m ±P«« Let us 
restrict the illustration to the plus-sign case r 2 3 = 1> f\% m ?n* 

The meaning of r 2 z — 1 is that x 2 and x z in the sample, uneofrected 
for variations in x\, are indistinguishable variables. Relation r n = fis 
shows that, if xi and x 3 were corrected for x h the corrections Would be 
identical; the resulting corrected variables are also identical. This 
can also be seen from the fact that in these circumstances r a^i ©quals 1. 
All this would, of course, be detectable from the top level of Fig. 19, 
signifying that three variables are too many and that any two are 
nearly as good as any other two. 


Example 2. The tilting beam 

What does it mean if beam Ru/Rn tilts toward one axis without 
shrinking in length? For instance, let Ru 7* and Ru = 0. This 
implies again that r 2 ;i = ± 1, that is to say, xt = ±x s . Taking again 
just the + case, this signifies that the uncorrected x t and x$ are in 
perfect agreement. However, Ru — ru - ru ^ or ru 9* n 3 ; take 
the case ru < r u for the sake of the illustration. The inequality 
ria /^ ru suggests that the corrections of x<i and x 3 to take account of 
variations in X\ will be different corrections and will upset the perfect 
harmony. This can be seen again from 

__ R23 __ 7*23 — ruru __ 1 — rnris . - 

r2M (B,JJ,0» (1 - rh)»(l - rl t )» (1 - r},)»(l - rj,)» * 

In terms of our example, there is a spurious perfect correlation between 
Xi, the number of skiers, and ar 3 , the weather. It is spurious because 
some skiers come to enjoy not the weather but the liquor. However, 
liquor sales respond less perfectly to tourist numbers than to weather; 
that is, ri2 < ri 3 . Therefore, if you take into account the fact that 
liquor too attracts skiers, the weather is not so perfectly predictable a 
magnet for skiers as you might have thought by looking at r 23 = 1. 
The hypothesis accepted in this case is: Liquor is significant and ought 
to be introduced in a predictive model. 


10. Q Show that, if beam ai 2 . 3 has the same slope as an, this implies 
O12.3 = rn and also ri 3 = r 23 and, hence, that all three beams of the 
bunch map come together. Interpret this. 

I0.R Interpret the situation where all three beams Ru/Rn, Rn/Rn, 
and —Rzi/Rn have the same slope. Must they necessarily have the 
same length? Must the common slope necessarily equal a i2 ? 

10.9. Testing for linearity 

If the rival hypotheses are 

y = j3x -f- u versus y = yx* + v 

the matter is quickly settled by comparing the correlation coefficients 
fxv with r x t. v . Things become complicated if the quadratic function 



contains a linear term, because the function y «■ yx* + fa <+• v con- 
tains the linear function j = fix + u as a special case; therefore, we 
would expect the correlation to be improved by adding a higher term. 
Thus, for any fit giving estimates #, 7, and S, r v .^ x t^ x) is bound to be 
greater than r v .tf x) . Correlation coefficients do not give the best tests of 
linearity. Common sense suggests something simpler and more 
The curves in Fig. 20a represent the two rival hypotheses. If the 







A 2 A 3 x 


Fig. 20. Tests of nonlinearity. 

quadratic is true but we fit a straight line, then the computed residuals 
from the fitted straight line will be overwhelmingly positive for 
some ranges of x and overwhelmingly negative for other ranges. These 
ranges are defined in terms of the intersections of the rival curves. 
Somewhere left of A most residuals arc negative, and to the right, most 
are positive. Complicated nu merical formulas for testing nonlinearity 
are nothing but algebraic translations of this simple test. 

All this generalizes quite rea dily . For instance, the test of hypothesis 
y = ax -f u versus a cubic is sketched in Fig. 206; a quadratic versus 
a cubic in Fig. 20c. And it generalizes into several variables x, y, z, etc. 


la each case the test consists in dividing the range of x into several 
equal parts Pi, P 2 , . . . , as shown in either Fig. 21a or 216. In each 
part compute the average straight-line regression residual av u. If this 
tends to vary systematically (with a trend or in waves), the relationship 
is nonlinear. 

When we have three or more variables x> y, z and want to test 
linearity versus some other hypothesis, we have to extend to two 
dimensions the technique of Fig. 21. Let the rival hypotheses be 

x = a + (3y + yz 4- u versus x = 5 + cy -f- ft/ 2 + yz + Oz 2 + *yz + v 

In the yz plane the intersection of these two surfaces projects a 
hard-to-solve-for and messy curve KLMNP (see Fig. 22a). Instead of 
obtaining it, let us see whether we can sketch it vaguely. Divide the 
sample range of y arid z into chunks, as shown in the figure (they do not 


Pi Pz P 3 Pa 


Pi Pz Ps 
Fig. 21. The interval test. 

need to be square, and they may overlap in a systematic way analogous 
to Fig. 216). In each chunk, compute the average linear residual av u, 
and see whether a pattern emerges. By drawing approximate contour 
lines according to the local elevation of av u, we may be able to detect 
mountains or valleys, which tell us that the true relationship is non- 
linear. Something analogous can be done when both rival hypotheses 
are nonlinear. 

10.10. Linear versus ratio models 

The rival hypotheses here are 

c XI 

- = a-f-|3--f-w versus c = y + 5y + en + v 

n n 

where u and v have the usual properties to ensure that least squares 
fits are valid. 



If the ratio model is the maintained hypothesis, then we would 
expect av u to be constant over successive segments of the axis y/n. 
Translated into the projection on the yn plane, this means that av w 
should be constant in the successive slices shown in Fig. 226. For the 
linear model, av v should be constant in the squares of Fig. 22c. In 
general, one criterion will be satisfied better than the other and will 
plead for the rejection of the opposite hypothesis. If both criteria are 













— - 





Poor Rich 


i Fig. 22 

substantially satisfied, then there is no problem of choosing, because 
both formulations say that cl y, and n are related linearly and homo- 
geneously (7 =s 0). One formulation might possibly be more efficient 
than the other for reasons of "skedasticity" (compare Sec. 2.15). 

10.11, Split sectors versus sector variable 

The rival hypotheses here are whether the demand for, say, sugar 
should be estimated for all consumers as a linear function of price and 


income q =* yp + 6y + w (where the price paid is uncorrelated with 
income) or should be split into several demand functions q = ap + w, 
q — Pp + v, etc., one for each income class, on the ground that price 
means more to the poor than to the rich. 

For illustration it is enough if we have just two income classes, the 
rich and the poor, corresponding to, say, y «= 10, y « 1. Nothing 
essential would be added if y were taken as a continuous variable. 

As in Fig. 22c, construct a grid for the sample range of variables y 
and p. If av w is constant, the single equation q — yp -\- 8y -\- w 
is good enough, and, moreover, we have a = in the alternative 
hypothesis. If,, however, the second hypothesis is correct, not only 
will a be very different from /§, but av w will display contours like 
those of Fig. 22d. 

10.12. How hypotheses are chosen 

In this section I am neither critical, nor constructive, nor original. 
I think it proper to look at the way that statistical hypothesis mak- 
ing and testing takes place around us. 

The econometrician, geneticist, or other investigator usually begins 
with (1) prejudices instilled from previous study, (2) vague impressions, 
(3) data, (4) some vague hypotheses. 

He then casts a preliminary look at the data and informally rejects 
some because they represent special cases (war years, for instance, 
or extremely wealthy people) and others because they do not square 
with the vague hypotheses he holds. He uses the remaining data 
informally to throw out some of his hypotheses, from among those that 
are relatively vague and not too firmly grounded in prejudice. 

At this stage he may prefer to scan the data mechanically, say, by 
bunch maps, rather than impressionistically. Mechanical prescreen- 
ing is used (1) because the variables are many, and the unaided eye is 
bewildered by them, and (2) because the research worker is chicken- 
hearted and distrusts his judgment. Logically, of course, any 
mechanical method is an implicit blend of theory and estimating 
criteria; but, psychologically, it has the appearance of objectivity. 
The good researcher knows this, but he too is overwhelmed by the 
illusion that mechanisms are objective. 

Having done all this, the investigator at long last comes to specifica- 


tion (as described in Chap. 1) ; he then estimates, accepts, rejects, or 
samples again. 

This stage-by-stage procedure is logically wrong, but economically 
efficient, psychologically appealing, and practically harmless in the 
hands of a skilled researcher with a feel for his area of study. 

Instead of proceeding stage by stage, is there a way to let the facts 
speak for themselves in one grand test? The answer is no. We must 
start with some hypothesis or we do not even have facts. True, 
hypotheses may be more or less restrictive. But the less restrictive the 
hypotheses are, the less a given body of data can tell us. 

Further readings 

For rigorous treatment of the theory of hypothesis testing, one needs to 
know set theory and topology. Klein's discussion, pp. 56-62, gives a good 
first glimpse of this approach and a good bibliography, p. 63. 

For treatment of errors in the variables, consult Trygve Haavelmo, "Some 
Remarks on Frisch's Confluence Analysis and Its Use in Econometrics," 
chap. V in Koopmans, pp. 258-2C5. 

Beach discusses bunch maps and the question of superfluous, useful, or 
detrimental variables, pp. 174-175. Tinbergen, pp. 80-83, shows a five- 
variable example. 

Cyril H. Gouldcn, Methods of Statistical Analysis, 2d ed., chap. 7 (New 
York: John Wiley & Sons, Inc., 1952), gives an elementary discussion of 
p and the sample properties of its* estimate r. 


Unspecified factors 

11.1. Reasons for unspecified factor analysis 

Having specified his explanatory variables, the model builder fre- 
quently knows (or suspects) that there are other variables at work that 
are hard to incorporate. 

1. The additional variable (or variables) may be unknown, like the 
planet Neptune, which used to upset other orbits. 

2. The additional variable may be known but hard to measure. For 
instance, technological change affects the production function, but 
how are we to introduce it explicitly? 

There are two ways out of this difficulty: splitting the sample, and 
dummy variables. When we split the sample we fit the production 
function to each fragment independently in the hope that each frag- 
ment is uniform enough with regard to the state of technology and yet 
large enough to contain sufficient degrees of freedom to estimate the 
parameters. The technique of dummy variables does not split the 
sample, but instead introduces a variable that takes on two and only 
two values or levels: when, say, there is peace, and 1 when there is war. 
Phenomena that are capable of taking on three or more distinct states 
are not suited to the dummy variable technique. For instance, it 



would not do to count for peace, 0.67 for cold war, and 1 for shooting 
war, because this would impose an artificial metric scale on the state 
of world politics which would affect the parameters attached to honest- 
to-goodness, truly measurable variables. No artificial metric scale is 
introduced by the two-level dummy variable. 

3. The additional factors at work may be a composite of many 
factors, too many to include separately and yet not numerous enough 
or independent enough of one another to relegate to the random term 
of the equation. j 

4. The additional variable may be known and measurable, but 
we may not know whether , to include it linearly, quadratically, or 
otherwise. j 

5. The additional variable may be known, measurable, etc., but 
not simple to put in. To admit a wavy trend line, for instance, eats 
up several degrees of freedom. 

In such cases the unspecified variable technique comes to our rescue, 
at a price, because it sometimes requires special knowledge. In the 
illustration of Sec. 11.2, for instance, to estimate a production function 
that shifts with technological change, time series are not enough. The 
data must contain information about inputs and outputs broken down, 
say, by region, or in some dimension besides chronology. 

11.2. A single unspecified variable 

This section is based on the technique developed by C. E. V. Leser 1 
in his study of British coal mining during 1943-1953, years of rapid 
technological change, nationalization, and other disturbances. 

He fitted the function P rt ~ QtL"fini where P is product, L is labor, 
C is capital, g t is the unspecified impact of technology, r and t are 
regional and time indices, and a, are the unknown parameters. 

Here for exposition's sake, I shall linearize his model and drop the 
second specified variable. Consider then 

P rt « g t + aL rt + U rt (11-1) 

1 C. E. V. Leser, "Production Functions and British Coal Mining" (Rfflnometrica, 
vol. 23, no. 4, pp. 442-446, October, 1955). 


The following assumptions are made: 

1. Technology affects all regions equally in any moment of time. 

2. The same production function applies to all regions. 

3. The random term is normal, with a period moan 


2 u « 

equal to zero, and a regional mean 



also equal to zero. We shall now use the notation av[r]itrt and av[^]u r< 
for expressions like the last two. l 

Now, keeping time fixed at it «■ 1, let us average inputs and 
outputs over the R regions. From (11-1) we get, remembering that 
av[r](7 f , - g h} 

av[r]P r < t - g tl + a av[r]L rt , (11-2) 

And, by subtracting (11-2) from (11-1), we get the following relation 
between P' ril and U rtl , which are product and labor measured from their 
mean values of period 1 : 

P' rh - «I4 + u rh (11-3) 

Do the same for t = 2, , . . , T and then maximize the likelihood of 
the sample. Under the usual assumptions, this is equivalent to 
minimizing the sum of squares 


The resulting estimate of a is 

& = ^ (11-4) 


In this expression the moments are sums running over all regions and 
time periods. 

1 Read "average over the r regions," "average over the t years." 


Having found a, we can go pack to (11-2) to compute the time path 

g t of the unspecified variable, technology. 
The method I have just outlined has several advantages: 

1. It uses R XT observations (a largo number of degrees of freedom) 
in estimating the parameter al. 

2. Unlike split sampling, it i obtains a single parameter estimate for 
all regions and periods. 

3. It yields us an estimate of the unspecified variable (technological 
change), if it is the only other factor at work. 

4. This technological change does not have to be a simple function 
of time. It may be secular, cyclical, or erratic; it can be linear, 
quadratic, or anything else. ! 

5. The method estimates, in addition to technology, the effects of 
any number of other unspecified variables (such as inflation, war, 

nationalization) which at any moment may affect all regions equally. 


The chief disadvantage of! the technique is that the unspecified 
variable g t has to be introduced in a manner congenial to the model, 
that is to say, as a linear term in a linear model, as a factor in Leser's 
logarithmic model, and so forth j otherwise it would not drop out, 
as in (11-3) when we express the specified variables as departures 
from their average values. 

For the unspecified variable technique to be successful it is necessary 
that the data come classified in one more dimension than there are 
unspecified variables. Thus P and L must have two subscripts. 
Moreover, each region must have coal mines in each time period. 1 

11.3. Several unspecified! variables 

Imagine now that we wish to explain retail price P in terms of unit 
cost C, distance or location D, monopoly M, and the general level of 
inflation J. Cost is the specified variable, and location, monopoly, 
and inflation are left unspecified for one or another of the reasons I 

1 There are methods for treating lacunes, or missing data, but these are rather 
elaborate and will not be discussed in this work. The usual way to treat a lacune 
is by pretending it is full of data that interpolate perfectly in whatever structural 
relationship is finally assigned to the original data. 


recounted in Sec. 11.1. The model, assumed to be linear, is 

Pfirt = Mi + D r +Jt + aCfM + Uf irt (11-5) 

where the subscripts /, t, r, t express firm, industry, region, and time. 
The model, as written, maintains that the degree of monopoly is a 
property of the industry only, not of the region or of inflationary 
situation or of interactions among the three. Similarly, inflation is 
solely a function of the time and not of the degree of monopoly and 
location of industry. Note again that the data have to come classified 
in one more dimension than there are unspecified variables. Thus P 
and C must have four subscripts, one for each of the unspecified 
variables, plus an extra one (firm /). Moreover, unless we have 
lacunes, each firm must be present in each industry, region, and time 
period. The firms of Montgomery Ward and Sears Roebuck would 
do, 1 and the industries they enter can be, say, watch retailing, tire 
retailing, clothing retailing, etc. 

In that case, a is estimated analogously to (11-4) by & = rnp'c'/m C 'c' f 
where the moments are sums running over /, t, r, t. Having esti- 
mated a, we can now define a new variable S, the price-cost spread 
S = jP - dC. The model is now 

Sfrt = Mi + D r + J t + v firt (11-6) 

Estimating M , Z), and J is the so-called problem of linear factor analysis. 

11.4, Linear orthogonal factor analysis 

Linear factor analysis attempts to explain the spread S as an additive 
resultant of two or more separate factors; in the example of (11-6) 
there are three factors: monopoly, region, and inflation. 

Nothing essential is lost if we confine ourselves to two factors, say, 
monopoly and inflation, and consider the simpler model 

Sf* - Mi + J t + p,u (H-7) 

To grasp its essence, imagine that there are no random disturbances 
(y = 0) and that there is only one firm, which sells three products 

1 Provided both exist in all time periods, regions, and industries included in 
the sample. 



(tires, watches, clothes) over 5 years. Observations can be put in a 
3-by-5 table or matrix whose rows correspond to the commodities and 
columns to the years: 

S = 

Factor analysis seeks to express this table as the sum of two tables 
M and J of similar dimensions, the first with constant rows and the 
second with constant columns*. 

s t , 



S u 












M = 

Mi Mx Mr Mr j Mr 
Mi Mi M 2 Mt \ M 2 
Mi Mz M z M 3 i Mz 

J = 

J\ Ji J* J\ */§ 

Jl Ji Jz Ji JjJ 
J\ Jz Jz Ji J& 

In a practical problem this cannot be done exactly, particularly if 
several firms are involved. This is the familiar problem of conflicting 
observations, which is treated in Sec. 7.2. In practice, some com- 
promise is found which gives the M and J that "fit best" the observa- 
tions S. 

A graphic way to express the problem of factor analysis is the 
following. You are given a rectangular piece, say, 3 by 5 miles, of a 
topographical map with contour lines showing the elevation at various 
spots. You are supposed to find a landscape profile running from 
north to south and another one running from east to west with the 
property that, if you slide the bottom of the first perpendicularly along 
the humps and bumps of the second, the top crests describe the original 
surface of the 3-by-5 map. The same happens if you interchange the 
roles of the two profiles. The two profiles are kept always perpen- 
dicular to each other; and this is why the literature calls the two fac- 
tors M and J orthogonal (that is to say, right-angled) . (See Fig. 23.) 

Computing differences among the various entries in M and J is a 
simple matter under the usual assumptions. Again, we minimize the 



with respect to M h M if Mz, Ji, J 2, Ji, Ji, Ji- Thus the solution for 


Mi is 

and that of J% is 




^3 = -^— - i ~ 




where F, /, jT are the total number of firms, industries, and time 
periods, respectively. Note that, to estimate the degree of monopoly 
in the first industry, we need knowledge of inflation in all years; to 
estimate inflation in year 3, we need measures * " monopoly for all 



y*K v/\y-"~~ 

North 3 miles South West 

Fig. 23 

5 miles 


industries. Equation (11-8) can be rationalized as follows: to estimate 
the effect of monopoly in the first industry, disregard the price-cost 
spread in all other industries, and compute the over-all (firm-to-firm 
and period-to-period) average spread in industry 1 : 



From this deduct the average inflationary impact 


What is left is the monopoly impact. 


11.5. Testing orthogonality 

It is entirely possible for inflation's impact on the price-cost spread 
to be related to monopoly. Indeed, there is evidence from th© Second 
World War that price control was more successful in monopolistic 
industries (and firms) than in competitive ones. A monopolist or 
monopolistic competitor is recognized and remembered by the public. 
If he takes advantage of inflation, he may lose goodwill or perhaps be 
sued by the government as an example to others. If monopoly and 
inflation interact in this way or in some other way, the linear model 
(11-7) is not applicable. Because it is simple, however, we may adopt 
it as our null hypothesis, fit b, and look for a systematic pattern dis- 
crepancy as a test of the hypothesis. 

The formulas for doing this are rather complicated expressions, 
which I shall not bother to state. Intuitively the test is quite simple. 
If by rearranging whole rows f.nd whole columns, table S can be made 
to have its highest entry in t,ho upper left-hand comer, its smallest 
entry in the lower right-hand corner, with each row and column 
stepping down by equal amounts, the null hypothesis holds. For 

s = 




can be rearranged thus: 

S' = 




Note that 

S' = 





o\ < n - 10 > 

To state the same test in terms of our geographic profiles of Sec. 11.4: 
Cut up the original map into north-south strips, rearrange, and then 
glue them together. Then cut the resulting map into east-west strips 
and rearrange these. Should this procedure produce a map of a terri- 
tory (1) sloping from its northwest corner down to its southeast corner, 
(2) with neither local hills nor saddle points, and (3) such that, if you 
stand anywhere on a given geographical parallel and take one step 
south, you step down by an equal amount, say, 3 feet, and (4) such 


that, likewise, if you start from any point on a fixed meridian, one 
eastward step loses the same elevation, say, 2.1 feet, then the factors 
are orthogonal. 

In arithmetical terms, having estimated Afi, M2, . . . , Js, rearrange 
the rows and columns so that the most monopolistic industry occupies 
the top row and the most inflationary year occupies the leftmost 
column. Compute the residuals 

tfu = Sfu - ifti - Ji 
and place their sums 


in the appropriate row and column. Any run, or large local concentra- 
tion, of mostly positive or mostly negative residuals is evidence that 
monopoly and inflation have interaction effects (are not orthogonal 

11.6. Factor analysis and variance analysis 

Unspecified factor analysis, the technique explained in this chap- 
ter, should be carefully distinguished from variance analysis (and 
from -factor analysis in the principal components sense of the term). 
Both techniques make use of a row-column classification, and both 
usually proceed on the null hypothesis that rows and columns do 
not interact. But here the similarities end. Factor analysis meas- 
ures the row and column effects for each row and column, i.e., it 
computes the unspecified variable. Variance analysis attributes 
various percentages of total variance 1 to differences among all rows, to 
differences among all columns, and the remainder to chance. Factor 
analysis endswith / -f- T estimates $1 u&t, . . . ,lSti\3i,J% 9 . . . ,«/r. 
Variance analysis ends with three percentages expressing row variance, 
column variance, and unexplained variance in terms of total variance. 

1 Total variance in terms of the example, model (11*7), is 

Y0S/« -avS)» 


where av S is av[/i7]<S/,r, or the average spread over the entire sample. 


In the course of analysis of variance, row means (14% and 12%) and 
column means (16, 13, and 12) are computed, but they are only aux- 
iliary quantities, not estimates of factor impacts. However, the 
differences in these two sets of means are equal respectively to the 
differences in the impact [(2 and 0) and (15, 12, and 11)] of the two 
variables into which S' is factorable [see equation (11-10)]. 

It is not my intention to go into the details of variance analysis. 
Just three comments about it: 

1. The reason why people analyze variance and not the fourth or 
seventeenth moment of the sample is this: A normal distribution with 
zero mean (such as the error term V/u) can be completely described by 
its variance. The variance is a sufficient estimate, for it contains all 
the information that is implicit in the assumed distribution. 

2. Under orthogonality, row, column, and unexplained variances 
add up to total variance, just as the square on the hypotenuse equals 
the sum of the squares on the other sides of a right-angled (orthogonal) 

3. Under normality and orthogonality, variance ratios have certain 
convenient distributions, which are suitable for testing the null 
hypothesis (that rows or columns differ only by chance). 

Further readings 

Harold W. Watts, "Long-run Income Expectations and Consumer Saving," 
in Studies in Household Economic Behavior, by Dernburg, Rosett, and Watts 
(Yale Studies in Economics, vol. 9, pp. 103-144, New Haven, Conn., 1958), 
makes judicious use of dummy variables. 

Robert M. Solow, "Technical Change and the Aggregate Production 
Functions" (Review of Economics and Statistics, vol. 39, no. 3, pp. 312-320, 
August, 1957), computes the unspecified variable "technology" not, as we 
have done in Sec. 11.2, by interregional aggregation, but by using the marginal 
productivity theory of distribution. 

Variance analysis is a vast subject. See Kendall, vol. 2, chaps. 23 and 24. 


Time series 

12.1. Introduction 

A time series x(t) — [x(l), . . . , x(T)] is a collection of readings, 
belonging to different time periods, of some price, quantity, or other 
economic variable. We shall confine ourselves to discrete, consecutive, 
and equidistant time points. 

Like all the kinds of manifestations with which econometrics deals, 
economic time series, both singly and in combination, are generated 
by the systematic and stochastic logic of the economy. The same 
techniques of estimation, hypothesis searching, hypothesis testing, and 
forecasting that work elsewhere in econometrics work also in time 

Why then a chapter on time series? Why indeed, were it not for the 
large amount of muddle and confusion we have inherited from many 
decades of well-intentioned but faulty investigations. 

The earliest and most abused time scries are charts of the business 
cycle and security market behavior. Desiring knowledge, business 
cycle "physiologists" avoided all models, assumptions, and hypotheses 
in the hope that the facts would speak for themselves. Pursuing 
profit, stock market forecasters have sought and are seeking (and their 
clients are buying) short cuts to strategic extrapolations; they have 



cared nothing about the logic, whether of the economy or of their 
methods. Their Economistry is the crassest of alchemies. 

The key ideas of this chapter are these: Facts never speak for 
themselves. Every method of looking at them, every technique for 
analyzing them is an implicit econometric theory. To bring out the 
implicit assumptions for a critical look, we shall study averages, 
trends, indices, and other very common methods of manipulating data. 

I do not mean to condemn the traditional approaches altogether. 
Certainly, physiology and "mere" description can do no harm — for 
ultimately they are the sources of hypotheses. To look for quick, 
cheap, and simple short cuts to forecasting is a reasonable research 
endeavor. Furthermore, modern machines can help by doing much of 
the dull work, provided that an intelligent being is available to study 
their output. 

12.2. The time interval 

Up to now I have carefully avoided any discussion of time. In the 

model of Chap. 1 

C t - cc + yZ t + u t (12-1) 

what does t — 1, 2, . . . , T represent, and why not select different 

The secret is that the time interval t, the parameters a and y, the 
variables C and Z, and the stochastic term u must be defined not 
without thought of but with regard to one another. If the time 
interval is short, then y must be the short-run marginal propensity to 
consume. If t is a year, then it makes sense for Z to be treated as 
predetermined. As the time interval is shortened, more and more 
variables change from predetermined to simultaneously determined. 
With shorter and shorter time periods, the causes that generate the 
random terms overlap more and more and invalidate the assumption of 
serially independent random disturbances. 

In certain cases we deliberately reduce the number of time Intervals 
of our data in order to bring time into agreement with the parameters 
and stochastic assumptions. For example, if we are trying to estimate 
a production or cost function and have hourly data for inputs and 
outputs, we may lump these into whole working days; otherwise the 
disturbances during the morning warm-up period, coffee break, lunch 


time, and the Various peak fatigue intervals are not drawn from the 
same Urn of Nature. 

The smoothing of time series must be done with care. In the above 
example, if the purpose is to make the random disturbance come from 
the same Urn in each interval, then overlapping as well as nonover- 
lapping workdays will do. If we also want the disturbances to be 
serially independent, then only nonoverlapping days should be used. 

Digression on- moving averages and sums 

Moving averages differ from moving sums only by a constant 
factor P equal to the number of original intervals smoothed 

If P is even ( = 2N) the average or sum should be centered 
on the boundary between intervals N and N + 1. If 2N -+- 1 
intervals are averaged, center on the (N -f* l)st. 

There are many smoothing methods besides the unweighted 
moving average. We are free to decide on the span P of the 
moving average and on the weight to be given each position within 
the span. Given P successive points, we may wish to fit to 
them a least squares quadratic, logistic, or other curve. Every 
particular curve implies a particular set of weights, and con- 
versely. Fitting a polynomial of degree Q through P points can 
be approximated by taking the simple moving average of a simple 
moving average of a simple moving average . . . enough times 
and with mi table spans. 

All this is straightforward and rather dull, unaccompanied 
by theoretical justification. What makes moving averages 
interesting is the claim that they can be used to determine and 
remove the trend of a time series. We shall see in Sec. 12.8 how 
dangerous a technique this is. As we shall see in Sec. 12.5, mov- 
ing averages give rise to broad oscillations where none exist in 
the original series. 

12.3. Treatment of serial correlation 

The term serial con-elation, or autocorrelation , means the noninde- 
pendence of the values u t and u t -e of the random terms. The term 
autorcgrcssion applies to values avand x t -$ when cov (x t) x t -o) ^ 0. 


In this section we consider briefly (1) the sources of serial correlation, 
(2) its detection, (3) the allowances and modifications, if any, that it 
occasions in our estimating techniques, (4) the consequences of not 
making these allowances and modifications. 

Random terms are serially correlated when the time interval l is 
too short, when overlapping observations are used, and when the data 
from which we estimate were constructed by interpolation. Thus, 
if, in 

Ct - a + yZ t + u t 

t measures months or weeks, then the random term has to absorb the 
effects of the months' being different in length, weather, and holidays, 
effects which are not random in the short period but which follow 
a cycle of 365 days. If, however, t is measured in years, then all these 
influences are equalized, one year with another, and u t loses some of 
its autocorrelation. Similarly, if successive sample points are dated 
"January to December/' "February to January," "March to Febru- 
ary," and so on, successive random terms are correlated at least 10/12 
(10 being the number of months common to successive samples). 

Frequently the raw materials of econometric estimation are con- 
structed partly by interpolation. For instance, there is a ceniui in 
1950 and in I960. Annual sample surveys in 1951, 1952, . . . meas- 
ure births, deaths, and migrations; these data, cumulated from 1950, 
should square with the census population figure of 1900. Slnej this 
seldom happens, the discrepancy in the final published figures tl 
apportioned (in general, equally) among the several years of thy ii@id@s 
The resulting annual figures for birth rate, etc., share equal portions 
of a certain error of measurement and are, therefore, correlated more 
than they otherwise would be. In a model that uses annual data on 
the birth rate and assumes that it is measured without error, it is the 
random term that absorbs the year-to-year correlation. 

We shall illustrate with the simple model (12-1). There are 
two ways to detect serial correlation. One is to maintain the null 
hypothesis that none exists: 

cov (u h Ut-$) - (12-2) 

estimate the model on this assumption, and then check whether 
m (t/,)(ti,-*; is near zero. The other way is to maintain that the random 


disturbances do have a serial connection, such as 

u t = f iw«-i +•"••+ teUt-e + v t (12-3) 

(with v t random and nonautocorrelated), and estimate the fs to see 
whether they are significantly different from zero. 

The first method is arithmetically easier, though a less powerful test. 
This requires explanation. The likelihood function of our sample is 
the same as (2-4) : 

L - (2tt)- 5 / 2 det (d ttU )-* exp [-Mu(* uu )- l u] 

where u stands for th(3 successive random disturbances (ui,W2, . . . ,Us). 
In minimising L by a and 7, we should get the greatest efficiency in 
& and $ if we took account of the fact that d is no longer diagonal 
when there is serial correlation among the disturbances. The null 
hypothesis cov (u t) u>-e) = 0, though it does not bias 61 or 1 or make 
them inconsistent, does nevertheless increase their sampling variances 
and covariances. The #'s are computed with the help of the inefficient 
& and i and are themselves inefficient estimates of the true dis- 
turbances. Therefore w^xa..*) is an inefficient (i.e., overspread) 
estimator of cov (u t ,v t -$) and provides a flabby test of serial correlation. 
It does not reject the null hypothesis with so much confidence as a 
more powerful test (i.e., one associated with a very pinched distribution 


Instead of testing by m^,)^,_ B ) it is recommended that we compute 
the expression 

■which happens to have convenient properties, which are of no concern 
to the present discussion. It is easily seen that, if m^^^g) = 0, then 
D(0) = 2. Large departures from this value indicate that the null 
hypothesis is untrue. 

The second method for taking into account the serial correlation of 
the random disturbance is more efficient than the first, but biased. 
To see this, consider the special case 

Ct^a + yZt + Ut (12-4) 

u t = fu,-i + v t (12-5) 

12.4. LINEAR 8YSTEMS 171 

As good simultaneous-approach proponents, we combine the two 
equations as follows: 

C t - fC«-i - a + y(Z t - fZ«..i) + V t (12-6) 

and maximize the joint likelihood of the random disturbances v with 
respect to the three parameters a, ?, f . Unfortunately, not only does 
this lead to a high-order system of equations, but the maximum likeli- 
hood estimates are biased. The reason for bias is the sam§ as in 
Chap. 3, namely, that (12-6) is a model of decay. 

There is yet a third method, which is somewhat biased and somewhat 
inefficient. First fit (12-4) by least squares, ignoring (12-5): this step 
is inefficient. Then compute f from (12-5) using the residuals 
H of the previous step: this introduces the bias. Next construct 
the new variables c t = C t — fC*_i, z t ~ Z t — l%t-\ and fit by least 
squares c t = a + yz t + w t to get a new approximation to a and 7. 
Repeat the cycle any number of tfe\?s. 

When several equations have autocorrelated error terms, this biased 
second method always works in principle. The first and third methods 
are dangerous to use because we know practically nothing about how 
good Sm^ t _ e /(S — 6)mn is as an estimator of the regression coeffi- 
cient of u t on u t -e', nor do we know whether the cyclical procedure of the 
third method converges. 

Matters get rapidly worse the more complicated the dependence of 
u t on its past values. 

12.4. Linear systems 

Most business cycle analysis proceeds on the assumption (sometimes 
explicitly stated, more often not) that an economic time series x(t) 
is made up of two or more additive components f(t), g(t) f . . . called 
the " trend," the "cycle," the " seasonal, " and the " irregular." Trend, 
cycle, and seasonal are supposed to be, in some relevant sense, rather 
stable functions of time; the irregular is not. We shall use the expres- 
sions "irregular," "random component," "error," and "disturbance" 
interchangeably. The word "additive" signifies, as usual, lack of 
interaction effects among the components. 1 

In analyzing time series, the problem is to allocate the observed 

1 See Sec. 1.11. 


fluctuations in x to its unknown additive components: 

*(<) - /(0 + 9(f) + *W + Wf (12-7) 

and to find the shapes of/, <7, and Jr. whether they are straight lines, 
polynomial or trigonometric functions, or other complicated forms. 

As stated, the problem is indeterminate. The facts will never 
tell us either how many additive terms the expression in (12-7) 
should have or what shapes are best. As usual, we must maintain a 
hypothesis — that the trend is, say, a straight line: 

that the cycle is some trigonometric function, e.g., 

7 sin (5 -h et) 

and so forth, the problem being to estimate the Greek letters from data 
or to see how well a given formulation fits in comparison with some 
rival hypothesis. 

Trigonometric functions can be approximated by lagged expressions, 
such as 

0(0 - 7o + yix(t - 1) + y 2 x(t - 2) + • • • + y Q x(t - Q) + t* (12-8) 

with appropriate coefficients. The term "linear" expresses the addi- 
tivity of the components of (12-7) or the linear approximation of 
(12-8) or both. In this section and in several more, we shall consider 
linear systems of a single variable x(t). Linearity in the second sense 
(above) is very handy, because in linear systems the number of lags 
in (12-8) and the values of the yB determine whether g(t) oscillates, 
explodes, or damps; the initial value g(0) determines only the amplitude 
of the fluctuations. In nonlinear systems amplitude and type are not 
separable in this way. 

We shall devote Sees. 12.5 to 12.7 to a priori trendless systems; then, 
in Sec. 12.8, we shall inquire how we know a system to be trendless 
and, if it has a trend, how this trend can be removed. 

12.5. Fluctuations in trendless time series 

A trendless or a detrended time series can be random, oscillating, or 
cyclical. It is random if it can be generated by independent drawings 


from a definable Urn of Nature. It is cyclical (or periodic) if it repeats 
itself perfectly every 12 time periods; it is oscillatory if it is neither 
random nor periodic. 

A simple trigonometric function like sin (2irt/Q) or sin (2irt/Q) -f b 
sin (2irt/ti) is strictly cyclical. The combination of two or more trigo- 
nometric functions with incommensurate 1 periods Qi, fig, . . . [for 
instance, x(t) = sin (27r//fii) + «os (27r//ft 2 )] is not periodic but oscil- 
latory. Commensurate periods Q if 12 2 , . . . appear in (12-7) only in 
the trigonometric terms sin, cos, tan, etc., and not as multiplicative 
factors, exponents, etc. 

With the exception of purely seasonal phenomena (which are 
periodic), economic time series are overwhelmingly of oscillating type. 
Oscillations arise from three sources: (1) the summation of non- 
stochastic time series with incommensurate periods, (2) moving 
averages of random series, and. (3) autoregressive systems having a 
stochastic component. 

We can briefly dispose of the last case first. If x(t) is an autogressive 

x(t) - aix(t - 1) + '. . . + a H x(t - H) +u< (12-9) 

whose systematic part would damp if u were to be continually zero, 
then x(t) can be expressed as a weighted moving average of the random 
disturbances, and so the third case reduces to the second case above. 2 
The moving average of a raidom series, however, oscillates! This 
proposition, the Slutsky proposition, shocks the intuition at first and, 
therefore, deserves some discussion. Let us take a time series so long 
that we do not have to worry about any shortage of material to be 
averaged by moving averages. Consider now a moving average 
spanning P of the original periods. To facilitate the exposition, let 
us take P amply large. Now the original series u(t) y if it is random, 
should itself be neither constant nor periodic. Because if it is constant, 
it is not random. And if it is periodic, a given value of u depends on 
the previous one; hence u{t) is not random. A truly random series is 
neither full of runs and patterns nor entirely bereft of them. Just as a 
true die, once in a while, produces runs of sixes or aces, so a random 

*Two real numbers are incommensurate when their ratio is not a rational 

2 See Kendall, vol. 2, pp. 406-407. 


time series occasionally exhibits a run. For the sake of illustration, 
suppose the run is 3 periods long and Wioi = W102 = Wios = 10. Now 
consider what happens to its moving average in the neighborhood of the 
run. Let the span be large relative to the run, say, P = 17. Then 
the moving average has a run (less pronounced and more tapered) 
19 periods long — that is to say, from the time that the right-hand end 
of the span includes U101 to the time that its left-hand end includes wi 03 . 
A moving average of a moving average of a random series oscillates 
even more. 

These simple properties are vital for the statistical analysis of 
business cycles. 

In the first place, the economic system itself operates somewhat like 
a moving average of random shocks: consumers, businesses, govern- 
ments get buffeted around by random external and internal impulses, 
such as weather, a rush of orders, a rash of tax arrears; the economy 
takes most of these things in its stride; it does not adjust instantane- 
ously and completely to the shocks, but rattier cushions and absorbs 
them over considerably larger spans than their original duration. 
The Slutsky proposition accounts for business oscillations as the result 
of averaging random shocks. 

In the second place, even if the economic system itself does no 
averaging, statisticians do. The national income, price indexes, and 
other data in all the fact books are averages or cumulants of one sort or 
another, frequently over time. Such data would exhibit oscillations 
even if the economy itself did not. 

Finally, analysts who use the moving average technique (on other- 
wise flawless data from an economy that is innocent of averaging) 
cither for detrending or for any other purpose may themselves intro- 
duce oscillations into their charts and so generate a business cycle 
where none exists. 

12.6, Correlograms and kindred charts 

According to the Slutsky proposition, if we want to analyze a time 
series we shall be well advised to leave it unsmoothed and try some 
direct attack. 

It is natural to ask first whether a given trendless time series \(t) is 
oscillating or periodic. In the nonstochastic case the question can be 


quickly settled by the unaided eye, detecting faithful repetition of a 
pattern, however complicated. In the stochastic cases the faithful 
repetition is obscured by the superimposed random effects a:; f i Iheir 
echoes, if any. 

Define serial correlation of order 6 as the quantity 

r \ = cov fa,x t -e) 


A correlogram is a chart with on the horizontal axis and p(0) or its 
estimate r(0) on the vertical. A strictly periodic time series has a 
periodic correlogram with always the same silhouette and the same 
periodicity. If the former is damped, so is the latter. A moving 
average of random terms has a damped (or damped oscillating) cor- 
relogram of no fixed periodicity. A nonexplosive stochastic auto- 
regressive system like (12-9) is a damped wave of constant periodicity. 

Correlograms are not foolproof. They may or may not identify 
correctly the type of model to which a given time series belongs. For 
instance, if the random term in (12-9) is relatively large, the correlogram 
of x(t) will compromise between the strictly periodic silhouette of the 
exact autoregressive system <xix(t — !)+•••+ <xnx(t — H) and the 
nonperiodic silhouette of the cumulated random terms ur -f ciun-i + 
• • • -f- a H ~ l u\. In general, it will neither damp progressively nor 
exhibit any fixed periodicity. This is very unfortunate, because, from 
a priori theory, we expect to meet such time series often in economics. 

Businoss cycle and stock market analysts are often interested in turn- 
ing points in a series and in forces bringing about these turning points 
rather than in the amplitude of the fluctuations. This leads naturally 
to periodograms. To take an example from astronomy, imagine that 
the time series \(t) measures the angle of Mars and Jupiter with an 
observer on earth. We know this series to be analyzable into four 
components: the revolutions of Earth, Mars, and Jupiter round the 
sun plus the minor factor of the earth's daily rotation. Periodograms 
are supposed to show, from evidence in the time series itself, the four 
relevant periods fii = 3G5.26 days, fi 2 = 687 days, Q 3 = H-86 years, 
and ft 4 = 24 hours. This is a relatively easy matter if the series is 
nonstochastic, if we know beforehand that only four basic periods are 
involved, or both. The composite series fluctuates and undergoes 
accelerations, decelerations, and reversals occasioned by the move- 


mcnts of its four basic components. All this is captured by the 


A « y X(8) cos -—- 

where 12 is an unknown period. The periodogram is a chart with 12 
on the horizontal and S 2 on the vertical axis. The value S 2 — A 2 + B 2 
attains maxima when 12 takes on the values Qi, 122, 123, 124. The tech- 
nique works fairly well if x(/) is indeed composed of periodic (trigo- 
nometric) terms and a random component. It works very badly 
when x(0 is autoregressive, because the echoes of past random dis- 
turbances are of the same order of magnitude as the smaller periodic 
components of x(t) = ct\x(t — 1) -f- • • • -f aux{t — H) and claim the 
same attention as the latter in the formula for S 2 . Like the cor- 
relogram, the periodogram fails us where it is most needed, that is, 
in the analysis of an economic time series which we know to be auto- 
regressive and stochastic though we know nothing about the number 
and size of its 12s. 

12.7. Seasonal variation 

The easiest periodic components to measure and allow for are those 
tied to astronomy. We know that the cycle of rain and shine repeats 
itself every 365 days, and we would naturally expect this to be reflected 
in any time series having to do with swim suits, umbrellas, or number 
of eggs laid by the average hen. The same is true of cycles imposed by 
custom or by the state, for instance, the seven-day recurrence of 
Sunday idleness, the Christmas rush, the preference of employees for 
July holidays. In all these cases the period itself is known, although 
it may be complicated by moving feasts, the varying number of days 
in a month, and the occasional occurrence of, say, a short month 
containing four Sundays plus Easter or a Friday the thirteenth. The 
problem here is not to find the seasonal period but its profile. 

It is one thing to recognize and measure the seasonal profile and 
another to remove it. Sometimes we want to do the former, some- 
times the latter, depending on our purpose. 


If the purpose is to forecast cycles and trends, it is a false axiom that 
a seasonally adjusted series is a better series. The only time we are 
justified in taking out seasonal fluctuations is when we believe that 
businessmen know there is seasonality, expect it, and adjust to it in 
a routine way, either consciously in a microeconomic way or in their 
totality when many millions of their microeconomic decisions interact 
to form the business climate. So, for forecasting purposes, it is 
legitimate to wash out seasonal movements only when they are washed 
out of the calculations of consumers and businessmen. If a seasonal 
exists but people have not detected it, it should be left in. For 
instance, if it were true that the stock market had seasonal properties 
unknown to its traders, they should not be corrected for, because the 
participants mistake these for basic trends and react accordingly. 
Conversely if the relevant people think there is a seasonal when in 
fact none exists, its imagined effect should be allowed for by the fore- 
caster of trends. Suppose, as an example, that the market believes 
that the U.S. dollar falls in the summer relative to the Canadian and 
rises in the winter. This imagined seasonal should be taken 'into 
account in analyzing the significance of monthly or quarterly import 
orders. To deseasonalize every time series may increase knowledge in 
all cases, but it increases forecasting accuracy only when the time has 
come when the market has learned all the real seasonals and imagines 
none where none exist. 

Every formula either for measuring seasonals or for removing them 
is an implicit economic theory, which may be appropriate for one 
economic time series and inappropriate for another. For Instance, 
treating the seasonal as an additive factor implies that a given absolute 
deviation from some normal or trend is equally important In all 
months. This is false in the case of, say, housing-construction starts 
in Labrador; the average number of these is, let us assume, 4 in 
December and 50 in July. Then 5 starts in December is a more 
serious departure than 51 in July. However, many analysis use 
additive seasonals for each and every time series. 

If the genuine seasonal period is 12 months, its profile can be approxi- 
mated by averaging the scores of several Januaries, then several 
Februaries, etc. This technique gives a biased estimate of the seasonal 
profile if the time series is autoregressive, unless random disturbances 
12 months apart are. independent. To see this, take (for simplicity 


only) the one-lag autorcgressive model 

x(i) = ax(t — 1) + 7 sin -^ + u t 

and let represent the first January and 12 the following one. For sim- 
plicity, let us average the values x(0) and a: (12) of just Januaries. 
Then we have 

z(12) = a 12 x(0) + a l2 Ui + a ll u 2 +•••■+ otUn + W12 + 7 sin 2ir 

which involves a moving sum of random terms, and this sum oscillates, 
as we already know from Sec. 12.5. The oscillation due to the random 
term will be confounded with the amplitude of the true seasonal. 
This will manifest itself in two ways: either the seasonal will seem to 
shift or, if it does not shift, it will contain the cyclical properties of the 
cumulated random effects. 

12.8. Removing the trend 

Ultimately, economic theory and not the facts tell us whether the 
trend (or longest-term movement) is linear or otherwise. If we obtain 
the trend as what is left after cycles and seasonals have been taken out, 
the trend inherits all the diseases and pitfalls of the seasonals. 

In particular, if we use a moving average to obtain the trend, we are 
almost certain to get it wrong. To see this, suppose that we have a 
trendless cyclical and stochastic phenomenon, say 

x t = sin — -f u t 

depicted in Fig. 24. If the span P of the moving average is longer 
than the true period 0, then the moving average (dashes in Fig. 24) 
exaggerates the oscillations and imposes a long wavy trend where none 
existed. Or again, if the system is autoregressive and trendless, 

s(0 - <*ix(t - 1) ■+ • • • + otnx(t - H)+u t 

the moving average of the random term contributes its oscillations to 
the systematic ones and, by the same process as that shown in Fig. 24, 
imposes a long, wavy trend. Naturally, distortions like these arise 



when x(t) truly contains some systematic trend. Moving averages 
distort both the trend and the cycles. 

The variate difference method eliminates trends on the ground that 
any trend can be approximated by a polynomial of some degree N 
and that such a polynomial can be brought down to zero after N + 1 
differentiation??. Therefore, let 

x{t) - 7o + yit+ • • • + yst" + f(t) + ut (1240) 

where f(t) and u t are the cyclical and random factors. The method 

Fig. 24 

proceeds as follows: 

1. Difference (12-10) once: 

x(t - 1) - 70 + 7i(* - 1) + • • ' 

+ y*t N ~ l + f(t - l) + w< -t <is-ii) 

2. Subtract (12-11) from (12-10) and call y(t) the new variable 
x(t) — x(t — 1). We do not need to write out y(t) in full Ml fiote 
only that its trend is a polynomial of one degree less than th§ poly- 
nomial in (12-10) and that its random component is 

v t = u t - u t -i (tS-12) 

3. Do the same for y(t), and define z(t) = y(t) — y(t — 1); this too 
reduces the power of the trend and generates a random component 

w t 

v t — v t -i = u t — 2ut-\ + Ut-2 


4. Continue in this fashion as long as the estimated covafiances 
m XX) rriyy/2, m„/6 decrease. (The correcting denominators are dis- 
cussed below.) 

To see what is going on, consider the first quadrant of Fig. 25, whore 



x(t) was taken to be a second-degree polynomial of L Then y(t) is a 
sloping straight line, and z(t) is a level one. The variance of x(t) is 
quite high, because x assumes many widely different values as t changes. 
The variance of y(t) is smaller, because, though y varies, it varies more 
smoothly than x. And z does not vary at all. The variate difference 
method reduces tho trend to z, and any remaining variation in the 
resulting scries must be due to nontrend components. 

Several things are wrong with this method. First, if x extends to 
the second quadrant of Fig. 25, say, symmetrically, its covariation 
with its lagged values may be very small or even zero. And, in general, 

Fig. 25 

a high-degree polynomial, because it twists and turns up and down, 
may exhibit a smaller lag covariance than a low-degree polynomial. 
Hence we should faithfully carry on successive differencing in spite of 
a drop in the series m xx , m vv , m M . But suppose we do. How are we to 
tell when the polynomial of unknown degree has finally died down? 
For meanwhile, as (12-12) and (12-13) show, we are performing 
moving averages of the cyclical component and, for all we know, this 
component may increase or decrease. Finally, the variate difference 
method cannot come to any stop if its cyclical component has a short 
lag. For instance, the first differences of 1, — 1, 1, — 1, . . . are 
2, —2, 2, —2, . . . , and tho first differences of the latter are 4, —4, 
4, —4, and so on. 

Now a word about the correcting denominators. If u t itself is serially 


uncorrelated, then, from (12-12), the variance of v t is twice that of u tf 


var v t = cov (u t u t ) - 2 cov (w<,W/-i) + cov (u,_i,u«_i) 

= cov (u h u t ) H- cov (u t -i,u t -.i) ■ 2 COV (f#|t«i) 

var w t = var (w* - 2w<„i + W/-2) 

= var u t + 2 var w<_i + var w<_j = var t*i 

and so on for higher-order differences. 

In my opinion, all these methods for detecting or eliminating the 
trend have serious imperfections. The way out is, as usual, to specify 
the algebraic form of the trend, the number of cyclical components 
acting on it, to make stochastic assumptions, and to maximize the 
likelihood of the sample. The procedure is very laborious; it is 
generally biased, but efficient. I think it represents the best we can 
ever do, and I am condemning the other methods only if they are 
pretentiously paraded as scientific. I do admit them as approxima- 
tions to the ideal. 

12.9. How not to analyze time scries 

The National Bureau of Economic Research has attracted a great 
deal of attention with its large-scale compilation and analysis of 
business cycle data. The compilation is done with such care, tenacity, 
and love as to earn the gratitude of all users of statistics. The analysis, 
however, has often been questioned. It proceeds roughly as follows: 

1. Define a reference cycle for all economic activity. This is a 
conglomerate of the drift of several time series, accorded various 
degrees of importance. 

2. Remove seasonal variations from the given series, say, carload- 
ings or business failures. 

3. Divide the given series into bands corresponding to the reference 

4. Within each band express each January reading as a per cent of 
the average January in the band, and so on to December. 

5. In each of the resulting specific cycles recognize nine typical 


positions or phases. The latter may be widely spaced, like an open 
accordion, in a long specific cycle or tightly in a short one. The result 
is now considered to be the business cycle in carloadings, and constitutes 
the raw material for forecasting, for computing the irregular effects, 
and for checking whether the given series can be said to have its typical 
periodicity, amplitude, etc. There are variations of the procedure, 
some ad hoc. After what I said earlier in the chapter about the 
pitfalls of time series, I shall not make any further comment on the 
National Bureau's method. Recently, electronic computations have 
been programmed, mainly for removing the seasonal. 1 As they 
involve the use of several layers of moving averages, they are not 
altogether safe in the hands of an analyst ungrounded in mathematical 
statistics; since, however, the seasonal is the least likely to cause harm 
(after all, the period is correct), we may set this question aside. 

12.10. Several variables and time series 

In Sees. 12.1 to 12.9 we have considered variables that move in 
time subject to shocks and to laws of motion unconnected with any 
other variables. It hardly needs stressing that endogenous economic 
variables are not of this kind, since all of them are generated jointly by 
the workings of the economic system. One wonders of what use is the 
analysis of individual time series despite the heavy apparatus of 
correlograms, pcriodograms, and variate differences. 

Assuming that several economic variables hang together structurally, 
what kinds of time series do they manifest? Sections 12.11 and 12.12 
discuss this problem. If several economic variables are unconnected, 
how docs a given combination of them behave? The answer to this 
question (Sec. 12.13) provides a null hypothesis for judging the effec- 
tiveness of averages, sums, and a variety of business indicators, like 
the National Bureau of Economic Research "cyclical indicators" and 
"diffusion indexes" (Sec. 12.13). The converse problems are also of 
great importance to the progress of business cycle research, because 
consideration of individual time scries may enable us to infer the nature 
of the economic system without laboriously estimating each structural 
equation by the methods of Chaps. 1 to 9. 

1 See Julius Shiskin, Electronic Computers and Business Indicators (Occasional 
Paper 57, New York: National Bureau of Economic Research, 1957). 


12.11. Time series generated by structural models 

What kinds of time series are generated when the two variables x 
and y are structurally related? We shall take up this question first for 
nonstochastic relations and then for stochastic relations under various 
simplifying assumptions. All our models will be complete. 

If the model is completely nonlagged, like the usual skeleton business 
cycle model 

r.cif <"-"> 

with investment taken as exogenous, then the time series for consump- 
tion and income have the same shape as the series for investment, as 
can be seen from the reduced form: 

(1 - p)C - a + pi 

(i-«r-« + / 

In this example the agreement is not only in the timing of turns but 
in the phase as well, because investment, consumption, and income are 
positively related. In a more extended model 

C = a + pY 

I « y + & (12-15) 

Y « C+/+G 

where investment is endogenous, government expenditure is exogenous, 
and investment is discouraged by the latter (5 > 0), all time series will $ 
coincide on timing; but when G grows / falls, and C and Y will fall if 5 
is less than — 1 and will rise if it is greater. 

If (12-14) and (12-15) are made stochastic, all endogenous variables 
absorb some of the random disturbances. The random disturbances 
apportion themselves, one year with another, according to a fixed 
pattern among the endogenous variables. For instance, if u is the 
random disturbance of the consumption function and v of the invest- 
ment function, the reduced form of (12-15): 

(1 - p)C = (a + py) + ((3 + p8)G + U+ $V 

(1 - P)I = (t ~ Py) + (« - P8)G + (1 - P)v (12-16) 

(1 - p)Y = (a + 7 ) + (l + W + « + v 


shows that the fluctuations and irregular components are in step, 
though they may differ in their amplitudes. 

The variances of the three irregular components in (12-16) are 
proportional respectively to <r uu + 2/?<r uv + 0V™, (1 — P)<rw, and 
<r uu + 2<r U v + (Tvv Thus, if the two random terms are positively 
correlated (<r uv > 0), income wobbles more than consumption and 
consumption more or less than investment, depending on the size of 
the marginal propensity to consume. 

Let us now consider as a recursive model the market for fish. The 
men go to their boats with today's price in their minds, expecting it to 
prevail tomorrow, and work hard if the price is high. Thus tomorrow's 
supply depends on today's price plus weather (z). Should the price 
fall, the fishermen don't put the fish back into the sea; so at the end of 
the day all the fish is sold. Demand is ruled by current price only. 

d = a + 0p + u 

S » y + dpi + €2 + V (12-17) 

s = d 

The model can be solved for p as follows: 

a + fip t -f u t = 7 + Spt-i + €2< + v t 

which shows that price tends to zigzag (0 negative, 8 positive), falling 
with good weather and rising with bad, as we might expect. In 
(12-17), unlike case (12-10), the irregular components of the price and 
quantity time series are no longer constant multiples of each other, nor 
are they in step. This is so because randomly overeager demand 
{u > 0) affects not only today's price but, through its effect on the 
fishermen's efforts, contributes to a fall in tomorrow's price as well. 

The connections among phase, amplitude, and irregularity in 
structurally related time series become very complicated as we increase 
the number of variables and as we admit more and more lags and cross 
lags. In any representative set of economic time series it would 
Indeed be a marvel if closely similar patterns emerged, except between 
such series as sales of left and of right shoes. And yet the marvel 
scorns to happen. 


12.12. The over-all autoregression of the economy 

Regardless of which came first, chickens and eggs in the long run 
have similar time scries, because there can bo no chicken without a 
previous combination of egg and chicken and there can be no egg 
without a previous chicken. Since the hatching capacity of a hen is 
fixed, say, 10 chicks per hen, and since the chicken-producing capacity 
of an egg is also fixed, say 1 to 1, the cycles in the egg population and 
in the hen population cannot possibly fail to exhibit a likeness — though, 
in particular short-run instances, random disturbances like Easter or a 
fox can grievously misshape now the one, now the other series. Orcutt 
has claimed 1 that something like this is true of the time series of the 
economy's endogenous variables. He states that the autoregressive 

x t+ i = l.Sxt - O.Sxt-i + u t +i 

fairly describes the body of variables used by Tinbergen in his pioneer- 
ing analysis of American business fluctuations. 2 Orcutt's result, if 
correct, would not exactly spell the end of structural estimation of 
econometric models, because the latter may bo more efficient, less 
biased, etc. However, if a correct autoregression were discovered, it 
would certainly short-circuit a good deal of current research. 

Orcutt's theorem holds only for systems whose exact part, by itself, 
is stable and nonexplosive. Orcutt also found that we can get better 
estimates of the over-all autoregression if we consider many time series 
simultaneously than if we consider them one at a time. This follows 
from the fact that Easter and foxes descend on eggs and hens inde- 
pendently, so that a grievous random dent in the egg population tends 
to be balanced by the relative regularity of the hen population. 

In the absence of random shocks, all the interdependent variables 
have the same periodicity but different timing, amplitudes, and levels 
about which they fluctuate. With random shocks, the periodicities 
are destroyed more or less depending on the severity of the shocks and 
their incidence on particular variables. The unaided eye can seldom 

1 Reference ia in Further Readings at the end of the chapter. 

2 Jan Tinbergen, Statistical Testing of Business-Cycle Theories (Geneva: League 
of Nations, 1939). 


recognize the true periodicity. A highly sophisticated technique can 
screen out the autoregressive structure by combining observations 
from all time series, but it is so difficult to compute that one might as 
well specify a model in the ordinary way. 


12.A Let g(t) and k(t) be the population of gnus and of kiwis. 
Let fa and 5» be the age-specific birth and death rates for gnus and aj 
and 7/ for kiwis. Disregard the question of the sexes. Let e and £ 
stand for input-output coefficients expressing the necessary number 
of kiwis a gnu must eat to survive, and conversely. Construct a 
model of this ecological system. Do something analogous for new 
cars and used cars. 

12. B In a Catholic region, say, Quebec, the greater the number 
of priests and nuns, other things being equal, the smaller the birth 
rate, because the clergy is celibate. But the more numerous the 
clergy, other things being equal, the higher the birth rate of the laity, 
because of much successful preaching against birth control. Construct 
an ecological model for such a population. 

12. C The more people, the more lice, because lice live on people. 
But the more lice, the more diseases and, hence, the fewer people. 
Construct the model, with suitable life spans for the average louse 
and human. 

12.D According to the beliefs of a primitive tribe, lice are good for 
one's health, because they can be observed only on healthy people. 
(Actually the lice depart from the sick person because they cannot 
stand his fever.) Construct this model and compare with Exer- 
cise 12,0. 

12.13. Leading indicators 

An economic indicator is a sensitive messenger or representative of 
other economic phenomena. We search for indicators in the same 
spirit in which pathology examines the tongue and measures the pulse: 
for quickness, cheapness, and to avoid cutting up the patient to find 
out what is wrong with him. 

A timing indicator is a time series that typically leads, lags, or 
coincides with the business cycle. Exactly what this means will 
occupy us later. We shall deal only with the leading indicators. 


From what was said in Sec. 12.12, it comes as no surprise that certain 
economic time series, like Residential Building Contracts Awarded, 
New Orders for Durable Goods, and Average Weekly Hours in Manu- 
facturing, should have a lead over Disposable Income, the Consumer 
Price Index, and so forth. The difficult questions are (1) how to 
insulate the cyclical components of each series from the trend, seasonal, 
and irregular; (2) how to tell whether leads in the sample period are 
genuine rather than the cumulation of random shocks; and (3) where 
phases are far apart, how to make sure that carloadings lead disposable 
income and not conversely, or that the Federal discount rate does lead 
and direct the money supply and not try belatedly to repair past 
mistakes. I am sure that, ultimately, one has to fall back on economic 
theory; one is forced to specify bits and pieces of any autoregressive 
econometric model, because no amount of mechanical screening of the 
time series themselves can answer the third question convincingly. 

In 30 years of research the National Bureau of Economic Research 
has isolated about a dozen fairly satisfactory leading indicators out of 
800-odd time series. 1 I think, however, that in nearly all cases, 
a priori considerations would have led to the selection of these leading 
series without the laborious wholesale analysis of hundreds and 
hundreds of time series. For instance, Average Hours Worked in 
Manufacturing is a good candidate for leading indicator of manu- 
facturing activity because we know from independent observation 
that it is easier for a business establishment to take care of a moderate 
increase in orders by overtime than by hiring new workers and easier to 
tide over a lull by putting its workers on short time than by laying some 
off at the risk of losing them. All the sensible leading indicators 
thrown up in the National Bureau's screening are obvious in a similar 
way. An oddity like the production of animal tallow, which is said 
to lead better than many other series, could not have been discovered 
by a priori reasoning, but neither is it used by any sane forecaster, for a 
good empirical fit is no substitute for a sound reason. 

Part of the findings of the National Bureau are, I think, tautological, 
because the timing indicators lead, lag, and coincide not with §ach 
other individually, but with the reference cycle, which is an index of 

1 See Geoffrey H. Moore, Statistical Indicators of Cyclical Revivals and Hefem'ons 
(Occasional Paper 31, New York: National Bureau of Economic Research, 1950), 
particularly chap. 7 and appendix B. 


"general business activity." The latter is a vague conglomerate of 
employment, production, price behavior, and monetary and stock 
market activity; therefore, it is no wonder at all that some series lead, 
some coincide with, and others lag behind it. The reference cycle is a 
useful summary, but we should not be misled into existential fallacies 
about it. 

12.14. The diffusion index 

A diffusion index is a number stating how many out of a given set of 
time series are expanding from month to month (or any other interval). 
Diffusion indexes can be constructed from any set of series whatsoever 
and according to a variety of formulas, of which I shall discuss just 

There are two reasons why one might want to construct a diffusion 
index. One is the belief that a business cycle starts in some corner of 
the economic system and propagates itself on the surrounding territory 
like a forest flro. This says in effect that the diffusion index is a cheap 
short-cut ftutorcgrcssive econometric model. The second reason is 
that the particular formula used to construct the index captures in a 
handy way the logic of economic behavior. 

Three different formulas have been suggested for the diffusion index: 

Formula A Per cent of the series expanding 
Formula B Per cent of the series reaching turns 
Formula C Average number of months the series have been 

Research by exhaustion argues that we ought to try all these 
formulas on all time series and choose the formula that gives prag- 
matically the best results. This can be done quite cheaply on the 
Univac. I think such a procedure will frustrate our search for good 
indicators, because each formula embodies a different theory of 
economic behavior, not universally suitable. 

Formula A is justified by the classical type of business cycle, where 
income, employment, prices, hours, inventories, production, and so on, 
and their components move up and down in rough agreement or with 
characteristic lags. Suppose, however, that the authorities control 
totals — employment, some price index, credit, or the balance of 


payments. The result is " rolling readjustment" rather than cycles. 
Formula A has lost its relevance. In a world of rolling readjustment 
this formula will show an uneventful record and will not be able to 
indicate, much less predict, sectional crises hiding under a calm total. 

Formula B is justified if consumers and business are more sensitive to 
turns, however mild, than to accelerations, however violent. Invest- 
ment plans are likely to be of this kind. As long as there is expansion* 
in demand, any overexpansion will be made good eventually. If there 
is contraction, however small, the mistake is more obvious, and panic 
may easily result. On the other hand, there are many areas in both 
the consumer and the business sectors where small turns are not taken 
seriously. Formula B, therefore, can be used to best advantage in 
studying certain investment series (like railways' orders of rolling 
stock) but is counterprescribed elsewhere. 

Formula C gives great emphasis to reversals. Take a component 
that has been slowly expanding for some months, then turns down 
briefly. Formula C registers (for this component) I? 2, 3, etc, up to a 
larjjjc positive number, then —1 (for the first month of contraction), 
The more sustained the expansion, the more violently does the formula 
register a halt or small reversal. This formula, then^ Is appropriate 
where habit and momentum play an important part. Where could 
we possibly want to apply it? 

Hire-purchase may be related to disposable income Sn some way 
that agrees with the logic of formula C. Suppose that small increases 
in income go into down payments and time payments iof morci and 
more gadgets; if so, a small fall in income would put a complete stop 
to new hire-purchase, because the family would continue tti oontraottial 
time payments on the old gadgets and would not be llkc'o to mit 
into food, clothing, and recreation to buy new gadgets, This Is a 
theory of consumer behavior, and formula C is a convenient way to 
express it short of an econometric equation. 


12.E Construct diffusion indexes by each formula from the two 
time series below: 

Series 1 








Series 2 









and compare the cyclical behavior of the indexes with that of the 
sum of the two series. 

12.F In Exercise 12.E, series 2, replace the 99 by 101, and con- 
struct formula A. Must turning points in the sum be preceded by 
turning points in the index? 

12. G Construct an example to show that an index according to 
formula C can be completely insensitive to the sum of the component 

12.11 Show by example the converse of Exercise 12.G, namely, 
that swings in the d illusion index formula C need not herald turns 
(or any change whatsoever) in the sum of the component series. 

12.15. Abuse of long-term series 

One unfortunate by-product of time series analysis is that it requires 
longtime series with which to work, and several research organizations 
have responded enthusiastically to the challenge. 

For example, I have heard urgings that we construct Canadian 
historical statistics for the purpose of sorting out timing indicators, on 
the ground that what took the National Bureau 30 years can now be 
done in 30 hours electronically. I think this kind of work quite futile, 
for a few moments' reflection will convince us of its negative results. 
The Canadian economy, compared with the American, is small and 
relatively unbalanced; therefore, Canadian historical statistics will 
have a very large irregular component, which will overwhelm the lino 
structural relationships we want to uncover. The Canadian economy, 
being open, responds to impulses from abroad; therefore, even if we 
had good domestic historical time series, our chances of finding among 
them good indicators are slim. We also know that the Canadian 
economy is "administered" (it has more governments per capita than 
we have and moro industrial concentration); so the developments that 
are foreshadowed by the indicators are likely to be anticipated by the 
big policy makers, with the result that predictions go foul. We know 
that Canada is and will be growing fast and that the past (on which all 
indicators rely) will not be a dependable guide* 

My guess is that the earliest useful year for time series on bread 
baking is somewhere around 1920. For iron ore shipments it is 1947, 
the year when certain Great Lakes canals were deepened. However, 


for housing demand as a function of family formation, many decades 
or even centuries might prove to contain valid information. 

There are many good reasons why we might want to construct 
uniformly long historical statistics, but certainly the needs of cycle 
forecasting is not one of them. 

12.16. Abuse of coverage 

An unfortunate by-product of diffusion index analysis is that it 
encourages the construction of complete sets of data when incomplete 
ones would be more satisfactory. This is so because the timing and 
irregular features of the diffusion index change with the number of 
scries included in it. 

Let us suppose that we want to forecast industrial production by 
means of average weekly hours worked; the series rationalised in 
Sec. 12. 13 is a possible leading indicator. If hours worked come broken 
down by industry, we suspect we might do better if we use a diffusion 
index of the basic series rather than the over-all average. 

Now our first impulse is to look at the published series for Hours 
Worked and make sure that they give complete coverage by industry 
and by locality and that the series have no gaps in time. After all, 
we want to forecast for all industry and for the entire country. Yet it 
is unreasonable to desire full coverage. 

First, some industries employ labor as a fixed, not a variable, input. 
A generating station, if it is operated at all, is tended by a switchman 
24 hours a day, regardless of its output. Labor is uncorrelated with 
output. Here is a case where coverage does harm to our forecast, 
because it introduces two uncorrelated variables on each side of the 
scatter diagram, so to speak. 

Second, in the service industries, the physical measure of output w 
labor input, because this is how the compilers of government Statistics 
measure the production of services. If we insist on coverage of the 
services, we get trivial correlations, not good forecasts. 

Third, during retooling it is possible to have long working hours and 
no industrial production. What should we do? Throw out entirely 
any industry that has retooling periods? Not at all. It is enough to 
suppress temporarily from consideration the data for this industry 
until the experts tell us that retooling and catching up on backlog are 



over. A deliberate time gap in the statistics improves them. This 
method, though it appears to be wasting information, actually uses 
more, for it includes the fact that there has been a retooling period. 
The fact that the diffusion index is dehomogcnized is a flaw of a second 
order of importance. 

In this way we select statistics of average hours worked to use in 
forecasting industrial production which are a statistician's nightmare: 
they have time gaps, they are unrepresentative, and they do not 
reconcile with national accounts Labor Income when they are multi- 
plied by an average of wage rates. 

Similarly, for statistics that are most useful in forecasting, it is 
not necessary that they be classifiable into grand schemes, such as 
the National Income, Moneyflows, or Input-Output Tables. The 
Canadians plan to start compiling data on Lines of Credit agreed 
upon by chartered banks and their customers but not yet credited to 
the customer's account. Such a series, I think, will prove a better 
predictor than the present one, Business Loans. Now, if we had 
information on Lines of Credit, it would not fit any existing global 
scheme and would not become any more useful if it did. For forecast- 
ing purposes, I see no excuse for creating a matrix of Inter-sector 
Contingent Liabilities or for constructing a Balance of Withdrawable 
Promises account. 

12.17. Disagreements between cross-section and time 
series estimates 

It is very puzzling to find that careful studies of the consumption 
function derived from time series give a significantly larger value for 
the marginal propensity to consume than equally competent studies of 
cross-section data. Three kinds of explanations are available: (1) 
algebraic and (2) statistical properties of the model explain the dis- 
crepancy; and (3) cross-section data and time series data measure 
different kinds of behavior. We shall concentrate on explanations 
1 and 2 in order to show that algebra and statistics alone account for 
much of the difference and that to this extent explanations of the third 
category are redundant. 

Cross-section data are figures of income, consumption, etc., by 
individual families in a given fixed time period. Time series are data 


about a given family's consumption and income through time or about 
national consumption and income through time. 

Algebraic differences 

The shape of the consumption function can breed differences. If th© 
family consumption function is nonlinear, say, 

C = a + 0y + yy 2 + U (12-18) 

then the consumption function connecting average income av y and 
average consumption av c or total income Y and total consumption C 
will look different from equation (12-18), even if all families have the 
same consumption function and if the distribution of income remains 
constant. To see this, take just two families, 

ci = a + ft/i + y(yi) 2 + Ui 
d = a + ft/a + 7(2/2) 2 + u 2 

add together and divide by 2 to get 

avc = a + /?av2/ + 27(av y) 2 — 72/11/2 + avu (12-19) 

and, in general, with N individuals, 

avc = a + j3av2/ + Nyfav y) 2 — 7 Y Mi + av « (12-20) 

One might argue that, when income distribution remains unchanged, 
the cross term Zyflj remains constant and is absorbed into the estimate 
of a. But this is false, because the cross term appears in (12-20) 
multiplied by 7, another unknown parameter, whose estimate is bound 
up with the estimates of a and p in the least squares (or other) formulas. 
The discrepancy between (12-20) and (12-19) affects the estimates of 
all three parameters a, ft and 7 if no allowance is made for the extra 
terms of the average consumption function. The last two terms of 
(12-20) are equal to 

» — i 

that is to say, the raw moment m[ u of the family incomes. It follows 
that, for time series and cross-section studies to give agreeing results, 


the average (or the total) consumption function must contain a term 
m/y expressing inequality of income, even if this inequality should 
remain unchanged from year to year. Moreover, neither the sample 
variance of av y nor the Pareto index is suitable for the correction in 

If income distribution varies with time, to get the two approaches 
to agree our correction must be more elaborate, because the factor 



must be calculated anew for each year of the data. If we have no 
complete census of all families, a sample estimate of 2 ?/,•?/,• will be better 
than nothing. If a census of families exists but we are in a hurry, 
again we can approximate Zysjj to any desired degree by taking the 
families in large income strata. 

Statistical differences 

Let us assume that the consumption function of a family is linear 
and constant over time and that it involves another variable x, 
reflecting some circumstance of the family, like age. 

c = a + fry + yx + u (12-21) 

However, let the characteristic z, as time passes, have a constant 
distribution among the several families. For example, in a stationary 
population, the ages of the totality of families remains unchanged, 
although the age of any given family always increases. If we aggregate 
(12-21) we get 

avc = a-f-/3av?/-f7ava;-r-avw (12-22) 

but x, being a constant, is absorbed into the constant term when we 
estimate (12-22). Not so if we trace the history of one such family by 
estimating (12-21) from time series. 

In practice there is a further complication: the characteristic x is 
not independent of the family's income; thus, /§ and $ are shaky esti- 
mates in (12-21) because of multicollinearity. This is an additional 
reason why time series and cross sections disagree. 

Thus, we do not need to go so far afield as to postulate several kinds 
of consumption functions (long-term, short-term) to explain these 
discrepancies. If, after we have corrected for the algebraic and 


statistical sources of discrepancy, some further disagreement remains 
unexplained, that is the time for additional theories. 

Further readings 

Kendall, vol. 2, devotes two lucid chapters to the algebra and statistics 

of univariate time series. 

The proof that ignoring the serial correlation of the random term in a single 
equation leaves least squares estimates unbiased and consistent can be found 
in F. N. David and J. Neyman, "Extension of the Markoff Theorem on Least 
Squares" (Statistical Research Memoirs, vol. 2, pp. 105-116, December, 1938). 

How to treat serial correlation is discussed by D. Cochrane and G. H. 
Orcutt, "Application of Least Square Regression to Relationships Containing 
Auto-correlated Error Terms" (Journal of the American Statistical Association, 
vol. 44, no. 245, pp. 32-61, March, 1949). 

Eugen Slutsky, "The Summation of Random Causes as the Source of 
Cyclical Processes" (Econometrica, vol. 5, no. 2, pp. 105-146, April, 1957), 
is rightly famous for its contribution to theory and its interesting experi- 
mental examples with random series drawn from a Soviet government lottery. 

Correlogram and periodogram shapes are discussed in Kendall, vol. 2, 
chap. 30. 

The brief discussion of autocorrelation, with examples, in Beach, pp. 17&- 
180, is simple and useful. 

The early article by Edwin B. Wilson, "The Periodogram of American 
Business Activity" (Quarterly Journal of Economics, vol. 48, no. 3, pp. 375-417, 
May, 1934), is both ambitious and sophisticated. 

Tjalling C. Koopmans, in his review, entitled "Measurement without 
Theory, " of Arthur F. Burns and Wesley C. Mitchell's Measuring Business 
Cycles (Review of Economic Statistics, vol. 29, no. 3, pp. 161-172, August, 1947), 
delivers a classic and definitive criticism of some investign tors' avoidance of 
explicit assumptions. All would-be chartists should read it. Koopmans also 
gives, on p. 163, a summary account of the National Bureau method for 
isolating cycles. 

J. Wise, in "Regression Analysis of Relationships between Autocorrelated 
Time Series" (Journal of the Royal Statistical Society, ser. B, vol. 18, no. 2, 
pp. 240-256, 1956), shows that, in recursive systems of two or more equations, 
least squares is biased both when the random terms of the separate equations 
are interdependent and when the random term of either equation is serially 

The reference of Sec. 12.12 is G. H. Orcutt, "A Study of the Autoregressive 
Nature of the Time Series Used for Tinbergen's Model of the Economic 
System of the United States 1919-1932," with discussion (Journal of the 
Royal Statistical Society, ser. B, vol. 10, no. 1, pp. 1-53, 1948). Arthur J. 
Gartaganis, "Autoregression in the United States Economy, 1870-1929" 


(Economctrica, vol. 22, no. 2, pp. 228-243, April, 1954), uses much longer time 
series and concludes that the over-all autorcgressive structure changed 
drastically around the year 1913. Gartaganis uses six lags. 

I have discussed the mathematical properties of the diffusion index in 
"Must the Diffusion Index Lead?" (American Statistician, vol. 11, no. 4, 
pp. 12-17, October, 1957). Geoffrey Moore's comments are on pp. 16-17. 

Trygve Haavclmo, "Family Expenditures and the Marginal Propensity 
to Consume" (Economctrica, vol. 15, no. 4, pp. 335-341, October, 1947), 
reprinted as Cowles Commission Paper 26, affords a good exercise in the 
decoding of compact econometric argument. Haavelmo deals with the 
discrepancies arising from different ways of measuring the consumption 


Layout of computations 

I recommend a standard layout, no matter how large or small the 
model or what estimating procedure one plans to use (least squares, 
maximum likelihood, limited information) or what simplifying assump- 
tions one has made. There are three general rules to follow:] 

1. Scale to avoid large rounding errors and to detect other errors more 
easily. Scaling should be applied in two stages. 

a. Scale the variables. 

b. Scale the moments. 

2. Use check sums. 

3. Compute all the basic moments. This may seem redundant, but 
is actually very efficient if one wants 

a. To compute correlations. 

b. To experiment with alternative models. 

c. To get least squares first approximations. 

d. To select the best instrumental variables. 



The rules in detail 

Stage 1 

Scale the variables. Express all of them in units of measurement 
(say, cents, tens of dollars, thousands, billions, etc.) that reduce all the 
variables to comparable magnitudes. Scale the units so as to bring 
the variables (or most of them) into the range from to 1. For 

National income x\ = 0.475 trillion dollars 

Hourly wage rate x 2 = 0.182 tens of dollars 

Population x 3 = 0.165 billions 

Price of platinum Xa ■■ 0.945 hundreds of dollars per ounce 

This, rathei tfian the range 1 to 10 or 10 to 100, is preferred, because 
we shall include an auxiliary variable identically equal to 1. Then all 
variables, regular and auxiliary, are of the same order of magnitude. 

Stage 2 

Arrange the raw observations as in Table A.l. Note that the 
endogenous variables, the y's, are followed by their check sum Y and 
that, in addition to all the exogenous variables t\ % z 2y . . . , z#_i, we 
devote a column to the constant number 1, which is defined as the last 
exogenous variable 2//. These are then followed by the check sum Z 
of the exogenous variables including zh — 1 and by a grand sum 
X - Y + Z. 

Stage 3 

The raw moment of variable p on variable q is defined as 

where the sum is over the sample. A raw moment is not the same 
thing as the simple moment m pq defined in the Digression of Sec. 1.2. 
The simple moment m pq is also called the (augmented) moment from the 
mean of variable p on variable q. 

Compute the raw moments of all variables on all variables. This 
gives the symmetrical matrix m' of moments, shown in Table A.2. 
In Table A.2 the symbol m' is omitted, and only the subscripts appear; 
for instance, yoyi stands for m' vgUtm 



Table A.l 
Arrangement of raw observations 




Exogenous variables 










yid) • 
Vi(2) ■ 
yi(3) • 

Vi(S) • 

• yo(D 

• ya(2) 
' fr(3) 

• yo(8) 




zi(2) • 
H(S) • 

9l<8) ' 

• *if-i(2) 

• fjf-ift) 








Table A.2 

2/i2/i ' ' 

• yiya 


yi«i 2/i«2 • 

• • yiZH-i 

y\ • 1 













yayi • • 

• ycs/o 


yo2i yo^a • 

' • 1/02/f-l 

ya- 1 


y X 

YVx • • 

• Yy 


K*, Yz 2 • 

• • Yzh-i 




Z\V\ - - 

• Zxya 


Zl2l 2l«2 ' 

' ' Z\ZH-\ 


z x Z 


ziyi ' ' 

- z 2 ya 

z 2 Y 

Z2Z1 Z&t • 

• Z&H-X 














Uff-iyi* • 

' ZH-iyo 


ZH-\Z\ ZH-lZ* ' 

• ' ZH-lZH-l 




— » 

1 • V\ • • 



I'Zl 1 • Z% • 

• 1 ' 2ff-l 




Zyi • • 

• Zy 


Z«i £« 2 • 

• Zzir-i 




Xy, • • 

• Xy a 

xr x«, x« 8 • 

• XZH-I 





Stage 4 

Compute the augmented moments from the mean of each variable 
(except zh — 1) on each variable, e.g., 

m XlXi « Sm XiXi - m Xx . x m\. x% 

This is done very easily because m Xi . x and m Xx . x are always on the level 
indicated by the arrows and in the row and column corresponding to 

This procedure gives a square symmetric matrix m of moments 
from the mean. The new matrix contains one row and one column 
less than the matrix m\ 

Stage 5 

Rule for check sums. In both m and m' any entry containing a 
capital Y (or Z) is equal to the sum of all entries in its row that contain 
lower-case y's (or z's). Any entry containing a capital X is equal to 
the sum of everything that precedes it in the row. 

All these things are true in the vertical direction, since the matrices 
m' and m are symmetric. 

Stage 6 

Scale the moments. This step is not always possible. Scan the 
symmetric matrix in. If it contains any row (hence, column) of 
entries all (or nearly all) of which are very large or very small relative 
to the rest of the rows and columns, divide or multiply the entire 
offending row and column by an appropriate power of 10. The purpose 
is to make the matrix m contain entries as nearly equal as possible. 
When moments have comparable magnitudes, matrix operations on 
them are very accurate, rounding errors are small, and calculating 
errors can be readily detected. 

Keep accurate track of the variables that have been scaled up or 

down in stages 1 and 6 and of how many powers of 10 in each stage 

and altogether. 

Stage 7 

Coefficients of correlation. These can be computed very easily 
from m, but unfortunately the checks do not work in this case. So 


drop the check sums and consider only part of m. The saiftpl© 
correlation coefficient between, say, the variables y a and zh is 


Coefficients of correlation are used informally to screen out the most 
promising models (see Chap. 10 on bunch maps). 

Matrix inversion 

This is a frequent operation in estimating medium and large systems. 
Details for computing 

M- 1 and MhW 

are given in Klein, pp. 15 Iff. There are various clever devices for 
inverting a matrix and performing the operation M -1 N. Electronic 
computers have standard programs, and it is well to use them if they 
are available. If M is small in size and if both M" 1 and M""*N are 
wanted, do the following: Write side by side the matrix M, the unit 
matrix of the same size, and then N. 


Then perform linear combinations on the rows of the entire new 
matrix [MIN] in such a way as to reduce M to a unit matrix. When 
you have finished, you will have obtained 

[I][M~ l ][M- l N] 
For example, let 

n/r T2 41 K re 30 sol 

We shall trace the evolution of [M][I][N] into [IHM^HM^N], 


(RON). -[J 4 


6 30 50 
2 1 3 

Divide the first row by 2. 


2 i y 2 

6 I 1 

3 15 25 
2 1 3 



In this new matrix, subtract row 1 from row 2 

j M 

(MIN), = [J 
Divide the new second row by 2. 

(min),* [J ; 

3 15 
-1 -14 

! 3 






Subtract the new second row from the first. 


A 1 ° 

[0 2 

% -X 

-k y 2 


22 36] 
-7 — 11 J 

Divide the new second row by 2 





H -K 


36 ] 
/ ^-5^J 

[M- l N] 

[M- l l 

One can compute a string of "quotients" M -1 N, M _1 P, M _1 Q, etc., 
by tacking on N, P, Q and performing the linear manipulations. This 
technique works in principle for all sizes of matrices with more than 
three or four rows, but it consumes a lot of paper and time. 


Stepwise least squares 

Estimating the parameters of 

V = 7o + 71*1 + 72*2 + 73*8 + ' • • + 7//*// + U 

by desk calculator, according to Cramer's rule, or by matrix inversion 
is a formidable task when // is greater than 3. The stepwise procedure 
about to be explained may be slow, but it has three advantages over 
the other methods: 

1. It can be stopped validly at any stage. 

2. It possesses excellent control over rounding and computational 

3. We do not have to commit ourselves, ahead of time and once and 
for all, on how many decimal places to carry in the course of computa- 
tions, but we may rather carry progressively more as the estimates are 
successively refined. 

I shall illustrate the method by the simple case 

w = ax + &y + 7Z + u 

where (in violation of the usual conventions) w is endogenous, and 
x, y, z are exogenous. For the sake of illustration let us assume that, 
in the sample we happen to have drawn, the exogenous variables are 



"slightly" intercorrelated, so that m xv , m XBf m xu , m va , m„ u , m au are 
small numbers, although, of course, m xx — m vv = m tt — 1. 

Step 1 

On the basis of a priori information, arrange the exogenous variables 
from the most significant to the least significant, that is to say, accord- 
ing to the imagined size of the parameters a, 0, 7, disregarding their 

M > 101 > M 

Step 2 

To estimate a first approximation to a, compute &\ = 7n wx /m xx as an 
approximation to the true value. Let &\ *■ a + A%. 


Form a new 


V «* W — OL\X 

and estimate a 

first approximation to P by computing 


* m vy 
m uy 


Form the new variable 

s = v - fry 

and then compute the first approximation to 7: 



Step 5 

Form the new variable 

Wl"«- 7i2 

and compute 

*"/il — 

m xx 

The idea here is to estimate the error A\ of our first approximation 


&i. Compute 

at = 5i — Ai 

as a second approximation to a. 

Step 6 

Now that a better estimate of a is available, there is no point in 
correcting the first approximations &, 71. We discard them and 
attempt to get new approximations &, 72, based on the better estimate 
a 2 . We first use a 2 to define a new variable 

Vi — w — a 2 x 

Note that in this step we adjust the original variable w (not w x ). 
Proceed now as in steps 3 to 5 : 

P2 == -—* 



si =* vi - (3 2 y 

72 = ^ 

W2 = Si — 722 

S3 = «2 — A2 

and so on. 

The method of stepwise least squares does indeed yield better esti- 
mates in each round. To see this, consider the steps 

- = mwx -- a Jb Wtfu+y+y)-* 
011 m tx ~ m xx 

= a + Ax 

v = w — <iix = ax 4- /% -f- 72 + w — ax — Atf 

= #2/ + 72 + w - Aix 

* m vy niiyz+u-Axx)^ 
" l ^T" = " "+■ ™ 

Tftyy TTlyy 

s = v - &*/ = 0*/ 4- 72 4- w - Atf - Py - B x y 
= 72 4- w — i4iz — Biy 

m 2 . m„ 

= 74-C 


t0i « u — fi« «■ 72 -f u — Ais — Biy — 7i* — Ciz 
= u — Aire — #iy — Ciz 

m zx m xx 

«i - fii - JTi - a + ki - At + B±igiHzgi±f 

I *W<U-ll|y-C,I>.* 

= a -t- 

m xx 

= a + A 

The residual factor A 2 is of smaller order of magnitude than Ai, 
and so d 2 is better than <5i as an estimate of a. To see this consider 

m xx ?7lxi m xx 

Expressing B\ and Ci in terms of Ai, 

3 * L m « m « m w m ** ** m « m w ^« J 

[#&„* ^*y W«< 77lyg 771,2, I 

m xx m vv ra xx m„ m vl/ J 

, |7n UI ^_ m vx m U i/ __ m ax m UM . m uv m tx m yt 1 

L 7n xx 7n xx rriyy m xx m M w vv m xx m, tu J 

Each bracketed term is of small order of magnitude. So, unless y is 
numerically very large (which was guarded against in step 1), it 
follows that a 2 is an improvement over &i. The same can be shown 
for J}*, 72, and <5 3 , compared with &, 71, and 5 2 , respectively. 

The method of stepwise least squares can also be used when x, y f and 
z are endogenous variables. In this case, although the bracketed 
terms are not negligible, they keep decreasing in successive rounds of 
the procedure. Had another variable been treated as the independent 
one, the stepwise method (like any procedure based on naive least 
squares) would, in general, have given another set of results. 


Subsample variances 
as estimators 

Consider the model y = a + yz + u, under all the Simplifying 

be the maximum likelihood (and also the least squares) estimate of 7 
based on a sample of size S. Its variance is 

«i,m = «« - *f)* = «« - yy = «feV 

|" (mi?i + • • • + ttsza)' "! 
- [ M H + 4)2 J 

Holding z\, , . . f zs fixed, this reduces to 

(*!+••• + 4) 2 *! + • • • + 4 m„ 

Let us now ask what happens on the average if from the original 
sample we obtain its S subsamples (each of size S — 1), if we compute 



the corresponding parameters 

and if we then compute the sample variance V of these fs. 

SV m V [f («) - av f (•)]» - V (B.) 2 - § ( V *.)' 

e(5F) - SuV m g Y(£.) a - | e ( V #.Y 

For our fixed constellation of values («i, . . . , zs) of the exogenous 
variable, any terms of the form $(%%&&) (t 3^ j) equal zero. By 
careful manipulation we obtain 

sev - (r ttU r y — j—3 ~ ± y — x — 2 

a a 

£ Z/ (m ta - z\)(m„ - z^J 

So far this is an ezed result. If we make the further assumption that 

the values zi, . . . ,z 5 of the exogenous variable are spread "typically," 

the relations 

. S- 1 
m„ — z; = — •$ — m lt 

m„ - z? - zj = g m„ 

are approximately true, or at least become so in the course of the 
summations given in the last brackets. Therefore, 

It follows that (S — l)V is an unbiased estimator of cov (^,^|S). 


Proof of least squares bias 
in models of decay 

Let the variable 5« be equal to 1 if time period t is in the sample, and 
zero otherwise. The least squares estimate of 7 is 


y hytyt-i 


2 fctf-i 

The proposition e(^) < 7 will be proved by mathematical induction. 
It will be shown true for an arbitrary sample of S = 2 points; then its 
truth for £ + 1 will be shown to follow from its truth for any & 

Definition. A conjugate set of samples contains all samples having 
the following properties : 

1. The samples include the same time periods and skip the same 
time periods (if any) ; let h be the first period included and ta the last. 

2. If time period t is included in the samples, then all samples of the 
conjugate set have disturbances u t of the same absolute value. The 
disturbances do not have to be constant from period to period. 



3. When a time period is skipped by the sample, algebraically equal 
disturbances must have operated on the model during the skipped 

4. The samples have come from a universe having, as of t l} the same 
values for all predetermined variables. 

Consider an arbitrary sample of two points. Let one point come 
from period j and the other from period k (k > j) ; the sample can be 
completely described by the disturbances operating at and between 
these two time periods; that is, 

Si = (+ty,ty+i, • • • ,w*_i,+WjO 

Si has three conjugates: 

S 2 = (+tt/,% + i, . . . ,w*_i,-w*) 
Sa = (— Uj,Uj + ii . . . ,u k -i,+u k ) 
S 4 = (— Uj'M+u . . . ,Wfc_i,-w*) 

Denote the four corresponding least squares estimates of y by the 
symbols t(++), 7(+~), 7(-+), and ?(--). 

By definition, each of the four conjugate samples has inherited from 
the past the same value ?//_i of lagged consumption. In period j the 
random disturbance Uj operates positively for samples Si and S 2 and 
negatively for S3 and S4. Therefore, in the next period Si and S2 
inherit one value v = yyj-i + u, for lagged consumption, and samples 
S3 and S« inherit another value n = 72//-. 1 — Uj. By the definition of 
conjugates, in periods j + 1, j + 2, . . . , k — 1, equal random 
disturbances affect all samples Si to S 4 . Moreover, in period /c, 
samples Si and S2 receive an equal inheritance of lagged consumption 
from the past. Call it y p . Its exact value can be obtained by applying 
model (3-2) (see Sec. 3.2) to p successively enough times, but this value 
is of no interest. Samples S3 and S4 each get the inheritance y n which 
likewise arises from the application of model (3-2) to n. The two 
inheritances are different: y p > y nf since p > n. 

When we come to period &, samples Si and S2 part company, because 
the first receives a boost +u k) and the second receives the opposite 
— Uk. For the same reason S3 parts company with S4. 


q = TJ P + u k r = yy p - u k v = yy H -f- u k w = yy n - u h 


The four conjugate estimates are 

•V 1/ " ,i2 , i 7 i2 '\ / " ,.2 i ,,2 

i/y-i T* 2/n i/;-l T £'n 

Symbolize the sum of these four estimates by Y ^ (± ±) or Y f. 

ceaj Si 

v«/4. 4.\ - 2py y -i + (g + r)y p 2ny y .i + (g 4- io)y, 
\ */, 2 -i + 2/p T nJL t +:.yS / 

\ vU + vl vU + vlJ 

= Ay -f residual 

The residual is always negative, because y\ > y\ if y>_i > 0, and 
2/? < yl if 2/;-i < 0- Therefore the average ^ estimate from this 
conjugate set of samples is less than the true 7. 

Consider an arbitrary sample of size S + 1, which I call sample 
B(-f). Let it contain observations from time periods ji, j 2 , . . . t js, 
js+i (which need not be consecutive). B(+) can be completely 
described by the disturbances that generated it plus the predetermined 
condition ?/ ; v-i. 

B(+) - (Vh-il*tMu • • • tUfcWiJ 

Now consider another sample A which contains one time period 
(the last one) less than B(-f-) but which is in all other respects the 
same as B(-h): 

A - (sfo-ij u ht u Jv . . . ,u Js ) 

The conjugate set of A can be described briefly by 

conj A « (yy r i; ±u jlf ±u jv . . .,±u Ja ) 

The conjugate set of B(+) has twice as many samples as conj A; the 
elements of conj B(+) can be constructed from elements of conj A by 
adding another period in which the disturbance wy a+ , shows up once 
with a plus sign and once with a minus sign. 
Define B(— ) as the sample consistent with predetermined condition 


y ix -i and containing all the disturbances of sample B(-f ) identically, 
except the last, which takes the opposite sign. Therefore, if 

B( + ) = (fr,-i;tCft,1ffe • • • ,Wfc,+Wy*J 

then 'B(-) = (y^-i) u h ,u jv . . . ,u Ja , - u ja+l ) 

Assume that the estimates ^ derived from conj A average less than 
the true 7 (0 < 7 < 1). Symbolize this statement as follows: 

2t(± ± • ' * ±) <2 S 7 
Each ^ in the above sum is a fraction of the form 

Let ^(A) stand for the estimate derived from sample A. Then ^(A) 
can be expressed as a quotient N/D of two specific sums N and D, 
where D is positive. Each sample from the set conj B gives rise to an 
estimate of 7. The formulas now are fractions like (1), but the sums 
have one more term in the numerator and one more in the denominator, 
because one more period is involved. Thus, Writing y f for y Ja) 

N + y'W + tffrj N + yy f * - y'u ja _ x 



It follows that 

D + y"* D + 7/' 2 

w><->] - N+ tl7> 

?[B(+)] + ?[B(-)] = 2^^ 

If -y(A) > 7, then the last fraction is less than 7(A); if 7(A) — 7, 
then the fraction equals 7; if 7(A) < 7, then the fraction is less than 7. 
Exactly the same is true for all samples in the conjugate set of A. 

2 > < 2 2 v < 2S + l y 

conj B conj A 

vrhich completes the proof. 

I shall not discuss what happens if the value of 7 does not lie between 
and 1, because no new principle or new difficulty arises. As an 
exercise, find the bias of 7 if 7 lies between and — 1. 


Completeness and stochastic 


The proof that s(u g ,u p ) = implies det B 5^ is by contradiction. 
If det B = 0, then a nontrivial linear combination of B's rows is (the 
zero vector) : 

L = Xi?i +•■■••. + \g$q - (1) 

Hence, Xigiy 2 + • • • + \o$ y 2 = 

But, by the model By + Tz = u, we also have 

So XiWi +•••. + \ Q u Q = (Xiyi +.•••+ XoYg)z = Z (2) 

Since Z is a constant number for any constellation of values of the 
exogenous variables, we have in equation (2) a nontrivial linear relation 
among the disturbances U\, . . . , Uq. This contradicts the premise 
that they are independent. 
This argument shows that £(u 0) u p ) — if and only if det B^O. 



The asterisk notation 

A single star (*) means presence in a given equation of the variable 
starred; a double star (**) means absence. 
Accordingly, in the model 

Vl + 7ll*l + 712*2 + 713*3 + 714*4 = Ml (1) 

0212/1 + 2/2 + 023?/3 + 721*1 + Y22*2 H" 723*3 = ^2 (2) 

0312/1 + 2/3 + 731*1 + 732*2 = u z (3) 

with reference to the third equation, 

y* means the vector of the endogenous variables present in the 
third equation, namely, vec (yi,yz) 
y** = vec (#2) 
z* = vec (*i,*2) 
z** = vec (*3,*4) 

For the first equation, 

y* ■ vec (?/i) 
y** = vec (2/2,2/3) 

Z* = VeC (*i,*2,*3,*4) 

Stars (single or double) may also be placed on the symbol x, which 



stands for all variables, endogenous or exogenous. Thus for the 

second equation, 

x* - vec (2/1,2/2,2/3,2:1,32,**) 
x** = vec (z*) 

G* is the number of if a present in the ^th equation ; G** is the number 
absent. H* is the number of z's present; H** is the number absent. 

G* m 1 H* x = 4 Gt* = //? - 3 m* m 1 H$* m 2 

«*, (#> T? are vectors made up of the nonzero parameters of the 
^th equation in their natural order, a here serves as a general symbol, 
like x, for all parameters, or 7. Examples: 

a? = vec (1,711,712,713,714) 
Bf - vec (1) 
7? = vec (711,712,713,714) 
7? = vec (721,722,723) 
aj = vec (031,1,731,732) 
yJ = vec (731,732) 

In Chap. 8, 1 place stars (or pairs of stars) not on vectors but on the 
variables themselves to emphasize their presence in (or absence from) 
an equation. For instance, in discussing the third equation above, 
we may write y* , t/**, y*, z*, **i z * *> z ** to stress that y h y if f i, z% do 
appear in the third equation whereas the other variables 2/1, z%, z K 
do not. 

Finally, AJ* means the matrix that can be formed from the elements 
of A by taking only the columns of A that correspond to the variables 
x** absent from the gth. equation. For example, 

** _ 


The columns of A** correspond to x** = vec (2/2,^3). 






1 . 


corresponding to x** « vec (y^z^zi). 






Aggregation, 5 
Allen, R. G. D., xvii, 51 
Assumptions, additivity, error term, 5, 

Simplifying, 9-17 

statistical, 4 

stochastic, 4, 6, 95 

(See also Error term) 
Autocorrelation (see Serial correlation) 
Autorcgression, 52-53, 61, 168, 195 

of the economy, over-all, 185-186, 

Bartlett, M. S., 51 

Beach, E. F., xvii, 21, 146, 155 

Bennion, E. G., 134 

Bias, 35-39 
instrumental variables, 114 
least squares, 38, 47, 68, 72 

in models of dccnv, 52-62, 209-212 
secular consumption function, 71 
simultaneous interdependence, 65-68 
(See also Unbiascdncss) 

Bogus relations, 89-90 

Bronfenbrenner, Jean, 72 

Bunch map analysis, 146-l50j 105 

Burns, A. F., 195 

Business cycle model, 171-1T4, 183-184 

Causality, 63-64, 106, 108, 112-113, 

Causation, chain of, 119-121 

Central limit theorem, 14 

Characteristic vector (eigenvector), 124 

Christ, C. F., 21, 134 

Cobweb phenomena, 16 

Cochrane, D., 195 

Completeness, 86, 106, 213 
(See also Nonsingularity) 

Computations, 32, 197-202 

Conflicting observations, 109 

Confluence, 103-106, 155 
linear, 142-144 

Consistency, 24, 36, 43-46, 51, 118 
in instrumental variables, 114 
in least squares, 44, 47-49, 195 
in limited information, 118 
and maximum likelihood, 44, 47-49 



Consistency, notation for, 44 

in reduced form, 100, 103 

in Thcil's method, 118, 128 
Consumption function, cross section 
versus time series, 192-196 

examples, 2, 63-72, 137-139 

secular, bias in, 71 
Correcting denominators, 180-181 
Correlation, 141-142 

partial, 144-145 

serial, 168-176, 195 

spurious, 150 
Correlation coefficient, matrix, 145 

partial, 144-145 

Bamplc, 141-142, 155 
computation of, 200-201 

universe, 141-142, 155 
Correlograms, 174-175, 195 
Cost function, 21-22 
Counting rules for identification, 92-96, 

Courant, R., 84 
Co variance, 19, 27, 82-83, 170, 207-208 

error term, 30, 49 

population and sample, 141 
Cramer's rule, 35, 203 
Credence, 2^, 4i-4L», 110 
Cross-section versus time-series esti- 
mates, 192-196 

(See also Consumption function) 
Cyclical form, 121 

(See also Recursive models) 
Cyclical indicators, 182 

(See also Business cycle model) 

David, F. N., 195 
Dean, J., 22 

Decay, models of, initial conditions in, 
53, 68-61 
least squares bias in, 52-62, 171, 

unbiased estimation in, 60-61 
Degrees of freedom, 3 
Determinants, 34-35 
Cramer's rule, 35 
Jacobians, 29, 73-74, 79-82 
Diffusion index, 182, 188-190, 196 
and statistical coverage, 191-192 


Discontinuity, of hypotheses, 137-138 

probability, 9-10 
Disturbances (see Error term) 
Dummy variables, 140, 156-157, 165 

Efficiency, 24, 47, 51, 118 
in instrumental variables, 118 
in least squares, 47-49 

and heteroskedasticity, 48 
in limited information, 118 
and maximum likelihood, 47-49 
in Theirs method, 118 
Eigenvector, 124 
Elasticities, price, and Haavelmo's 

proposition, 72 
Equations, autorcgrcssivc, 52-53 
simultaneous, 67-71, 73-84 

notation, 74-76 
structural, bogus, 89-90 
Error term, 4, 8, 9-18 

additivity assumption, 5, 18, 22 
covariance matrix, 30, 49 
Simplifying Assumptions, constancy 
of variance (no. 3), 13, 17, 77 
(See also Heteroskedasticity) 
normally distributed (no. 4), 14, 17, 

random real variable (no. 1), 9, 17, 

serial independence of (no. 5), 16, 

uncorrelated, in multi-equation 
models (no. 7), 78-79, 87, 
with predetermined variable (no. 
6), 16, 18, 33, 52, 53, 65n. 
zero expected value of (no. 2), 10, 
17, 77 
Errors, of econometric relationships, 
of measurement, 6, 48 
in variables, 155 
Estimate, variance of, 39-43 
Estimates in multi-equation models, in- 
terdependence of, 82 
Estimating criteria, 6, 8, 23, 47 
(See also Consistency; Maximum 
likelihood; Unbiascdncss) 


Estimation, 1-2, 8, 108-110 
simultaneous, 67-68, 126-135 
unbiased, in models of decay, 60-61 

Estimators, extraneous, 134 
subsample variances as, 207-208 

Expectation, 18-20 

Expected value, 10-11, 19-21 

Factor analysis, linear orthogonal, 160- 

unspecified, 156-165 

versus variance analysis, 164-165 
Fisher, R. A., 51 
Forecasts, criteria for, 132-134 

leading indicators as, 186-188 
Fox, K. A., 21, 134 
Friedman, M., 72, 134 
Frisch, R., 155 

Gartaganis, A. J., 195 
Gini, C., 51 

Goldberger, A. S., x, 21 
Goulden, C. H., 155 

Haavelmo, T., 71, 106, 155, 196 
Haavelmo proposition, 64-66, 71-72, 

Heteroskedasticity, 48-49 

efficiency, 48 

(See also Error term, Simplifying 
Assumptions, no. 3) 
Hogben, L., 12n., 46, 51 
Homoskcdastic equation, 50 

(Sec also Heteroskedasticity) 
Hood, W. C, xvii, 21, 85, 125 
Hurwicz, L., 22, 53, 62 
Hypotheses, choice of, 154-155 

discontinuous, K>7-138 

maintained, 130-137 

null, 138 

questioned, 137 

testing of, 136-155 

Identification, 3-4, 64, 85-106 
counting rules, 92-96, 102 


Identification, exact, 88, 9£Htt, 94, 107 
absence of, 91, 128-131 
of parameters in underidcfUified 

equation, 96-97, 128 
(See also Overidentification ; Under- 
Improvement, 44 
Incomplete theory, 5 
Independence of simultaneous equa- 
tions, stochastic, 78-79, 8C-87, 213 
Initial conditions in models of decay, 53, 

Instrumental variable technique, 107- 
properties, 114 

related, to limited information, 118, 
to reduced form, 113 
Instrumental variables, efficiency in, 
weighted, 116-117 
Interdependence, simultaneous, 63-72 

Jacobians, digression on, 79-82 

of likelihood function, 29 

references, 84 

and simultaneous equations, 73-74 
Jaffd, W., x 
Jeffreys, H., 51 

Kaplan, W., 84 

Kendall, M. G., xvii, 21, 50, 51, 165, 

173n., 195 
Keynes, J. M., 51 
Klein, L. R., xvii, 21, 51, 84, 106, 124n., 

125, 134, 155, 201 
Koizumi, S., 21 
Koopmans, T. C., xvii, 22, 72, 84, 85, 

94n., 106, 155, 195 
Kuh, E.,134 

Lacunes (missing data), 159n. 
Lagged model, 53-54, 62 

(See also Recursive models) 
Lango, G., 22 



Leading indicators, 186-188 
Least squares, 23-50 

bias, in models of decay, 52-62, 209- 

consistency, 44, 47-49, 195 
diagonal, and simultaneous estima- 
tion, 07-68 
directional, 69, 125 
efficiency, 47-49 

and hotoroskodastioity, 48 
generalized, 33-35 
Haavclmo bias, 65-67 
justification, 31-32, 134-135 
maximum likelihood, 24, 31-34, 47- 

49, 57 
naive, compared to maximum likeli- 
hood, 67, 82-83 
reduced form, 88, 98, 113, 127, 

references, 50-51, 134-135 
related to instrumental variables, 

relation to limited information, 125 
simultaneous estimation, 67-70 
stepwise, 203-206 
sufficiency, 47-49 
unbiasedness, 35-39, 47-49, 60, 65- 

used in estimating reduced form, 
Loser, C. E. V., 157 
Likelihood, 24-25, 50 
Likelihood function, 23, 28-31 
and identification, 90-91, 100-102 
(See also Maximum likelihood) 
Limited information, 4, 118-125, 128 
consistency in, 118 
efficiency in, 118 
formulas for, 123 
relation of, to indirect least squares, 

to instrumental variables, 118, 125 
Linear confluence, 142-144 
Linear models, multi-equation, nota- 
tion, 74-77 
versus ratio models, 152-153 
Linearity, testing for, 150-152 
Lh, T. C, 130-132 

Marginal propensity to consume (see 

Consumption function) 
MarkofT theorem, 195 
Marschak, J., 21 
Matrix, 3, 27 

coefficients, 75 

inversion, 201-202 

moments, 31, 124, 198-201 

nonsinguln/ity of, 86-87 

orthogonality, 100-164 

rank and identification, 93-94 

triangular (recursive models), 83 
Maximum likelihood, 23, 25, 29-33, 50, 

computation, 83 

consistency, 44-48 

efficiency, 47-49 

full information, 73-84, 118, 129, 133 

identification, 100-103 

interdependence, 82-83 
simultaneous, 63-71 

limited information, 118-125 

subsample variances, 207 

value for forecasting, 132-133 
Mean, arithmetic, expected value, 10-11 
Measurement, errors of, 6, 48 
Meyer, J. It., 134 
Miller, II. L., Jr., 134 
Mitchell, W. C, 195 
Model, 1 

business cycle, 171-174, 183-184 

lagged, 53-54, 62 

latent, 110-112 

manifest, 110-112 

recursive, 83-84 

supply-demand, 89 

(See also Linear models) 
Model specification (see Specification) 
Moments, 3, 20, 32-35 

algebra of, 51 

computation of, 32-33, 197-200 

determinants of, 34-35 

expectation of, 20 

matrices of, 34, 124 

raw, 198 

from samplo means, 18-20 

simple (augmented), 20, 198 
Moore, Y. II., 134, 187n., 196 
Moving average technique, 61, 168 


Moving average technique, Slutsky 

proposition, 173-174 
Multicollinearity, 3, 103-106 
etymology, 105-106 
(See also Confluence; Undcridentifi- 

National Bureau of Economic Research, 

181-182, 186-188, 195 
Nature's Urn, 11-13, 168 

subjective belief and, 11-12 
Ncyman, J., 195 
Nonsingularity, 86-87 

(See also Completeness) 
Normal distribution, multivariate, 26- 

univariate, 14—15 

Observations, 7 

conflicting, 109 

(See also Sample) 
Orcutt, G. H., 72, 185, 195 
Original form, 87-88 

(See also Reduced form) 
Orthogonality, 160-164 

test of, 163-164 

and variance analysis, 165 
Oscillations, 53, 172-174 
Overdetcrminacy, 88-89 

(See also Ovcridentification) 
Ovcridentification, 86, 89, 98-103, 129 

contrasted with underidentification, 
Overidentified models, ambiguity in, 98 

P lim (probability limit), 44 
Parameter estimates, constraints on, 

Parameter space and identification, 

Parameters, 2, 8 

a priori constraints on, 91-94 
Pareto index, 194 
Pearson, K., 51 
Periodograms, 175-176, 195 
Population, 7, 11-12, 141 


Prediction, 2 

(See also Forecasts; Leading indica- 
Probability, 3, 9-10, 24 

density, 9 

inverse, and maximum likelihood, 51 
Probability limit (P lim), 44 
Production function, 21-22 

Cobb-Douglas, 157-159 

Quandt, R. E., x 

Random term (see Error term) 

Random variable, 9 

Rank, 93-94 

Ratio models versus linear models, 152- 

Recursive models, 83-84, 121 
Reduced form, 87-88 

bias, 100-103 

dual, Theil'a method, 126-128 

and instrumental variable technique, 

least-squares forecasts, 133 
Residual contrasted with error, 8 

Sample, 6-7, 11-12, 141-142 

and consistency. 43-46 

size, 53 

and unbiasedness, 35-43 

variance, 39-43 
Samples, conjugate, 52, 54-57 
Sampling procedure, 6-21 
Scaling of variables, 198 
Schultz, H., 22, 9Qn. 
Seasonal variation, 176-178 

removal of, 182 
Sectors, independence and nonsingular- 
ity, 87 

split, versus sector variable, 140, 153- 
Serial correlation, 168-171, 175, 195 

in Simplifying Assumptions, 16, 18, 
Sets, conjugate, 55-57 

of measure zero, 111 



Sbiskin, J., 182n. 
Simon, H. A., 106 
Simplifying Assumptions, 9-18 
generalization to many-equation 

models, 77-78 
mathematical statement, 17-18 
sufficient for least squares, 32n. 
(Sec also Error term) 
Simultaneous estimation, 67^68, 126- 

Slutsky, E., 173-175, 195 
Solow, R. M., 165 
Specification, 1 
and identification, 85 
imperfect, 5 
stochastic, 6 
structural, 6 
Statistical assumptions, 4 

(See also Stochastic assumptions) 
Stephan, F. F., x 
Stochastic assumptions, 4, 6, 95 
(See also Error terms, Simplifying 
Stochastic independence, 78-79, 87, 213 
Subjective belief, 50 

and Nature's Urn, 11-12 
Subsamplc variances as unbiased esti- 
mators, 207-208 
Sufficiency, 47, 51 
Suits, D. B., 21 
Supply-demand model, 89 
Symbols, special, bogus relations, super- 
script ©, 89 
estimates, maximum likelihood, as 
in a (hat), 8, 67 
naive least squares, as in a (bird), 

8, 67 
other kinds, unspecified, as in 5 
(wiggle), 8 
matrices, boldface type, 33 
variables, absent from given equa- 
tion, superscript (**), 91 
present in given equation, super- 
script (*), 91 
Symmetrical distribution, 61 

Testing of hypotheses, 136-155 
Theil, H., 116, 126, 134 

Theirs method, 126-128 
consistency in, 118, 128 
efficiency in, 118 

stepwise least-squares computation, 
Time series, 166-196 
cyclical, 172 
generated by structural models, 1 83 

long-term, abuse of, 190 
National Bureau analysis, 181-182, 

186-188, 195 
random series and Slutsky proposi- 
tion, 173 
Time-series estimates versus cross-sec- 
tion estimates, 192-195 
Timing indicator, 186-188 
Tinbergen, J., xvii, 21, 134, 155, 185, 

Tintner, G., xvii, 21, 51, 62, 103 
Trend removal, 178-181 

Unbiasedness, 24, 35-37, 45-46, 51, 118 

in instrumental variables, 114 

in least squares, 35-39, 47-49, 60, 65- 

in limited information, 118 

reasons for, 53 

in reduced form, 100, 103 

in subsample variance estimators, 

in Theirs method, 128 

(See also Bias) 
Uncertainty Economics, 45 
Underdeterminacy, 88-89 
Underidentificatbn, 86, 88-91, 96, 100- 
106, 128 

of all structural relationships, 130 
Universe, 11-12, 141-142 
Unspecified factors, 156-165 

(See also Sectors, split) 
Urn of Nature (see Nature's Urn) 

Valavanis, 8., 22, 196 
Variables, endogenous, 2, 64 

errors in {see Errors in variab'es) 

exogenous, 2, 64 



Variables, independent (exogenous), 2, Watts, H. W., 165 

scaling, 198 
standardized, 145 
Variance, 19, 40-41 
covariance, 27-31, 82-83, 140-141, 

of estimate, 39-43 
estimates of, 41 
infinite, 14 

subsample as estimator, 207-208 
Variance analysis and factor analysis, 

164-1 C5 
Variate difference method, 21, 179-181 
Verification, 2, 136-155 

Weights, arbitrary, 50 

instrumental variables, 109-111, 116- 

and least squares, 109-110, 124 
Wilson, E. B., 195 
Wise, J., 195 
Wold, H., 135 
Working.. E. J., 106 

Yule, G. U., 51 
Zero restriction, 92 



Date Due 
Returned Due 




~pfU J~ brtfifrt 



Z&a IsQSlt 





AUG l 881 

JUL 2 1981 

AUG 3 1 mi